13 min readJun 13, 2020

Identifying Dogs using CNN Transfer Learning

Project overview

In this project, we build and train a neural network model with CNN (Convolutional Neural Networks) transfer learning, using 8,351 dog images of 133 breeds. CNN is a type of deep neural networks, which is commonly used to analyze image data.

Typically, a CNN architecture consists of convolutional layers, activation function, pooling layers, fully connected layers, and normalization layers. Transfer learning is a technique that allows a model developed for a task to be reused as the starting point for another task.

The trained model in this project can be used by a web or mobile application to process real-world, user-supplied images. Given an image of a dog, the algorithm will predict the breed of the dog. If an image of a human is supplied, the code will identify the most resembling dog breed of that person.

This article will describe the technical aspect of this project from start to finish.

Project Statement

Given an image of a dog, identify its breed. You must design and implement an optimal algorithm that would detect a human or a dog in a picture. If the algorithm detects a dog, identify its breed. If it detects a human, provide an estimate of the dog breed that is most resembling. If neither is present in the image, exit out gracefully. The image below displays a potential sample output.

Sample output of the algorithm — Sample output for the algorithm

Metrics

The notebook dog_app.ipynb is divided into seven distinct steps each with its own criteria. Well, there is a step 0 as well making up to be a total of eight steps but that only implements loading of the dataset. So, we will not count that on into our metrics.

Step 1: Detect Humans

We use OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images. OpenCV provides many pre-trained face detectors, stored as XML files on GitHub. This would be used in a function that would return True or False when a human face is detected in the image.

Using this function, we check What percentage of the first 100 images in dog and human files have a detected human face. Ideally, we would like 100% of human images with a detected face and 0% of dog images with a detected face.

Step 2: Detect Dogs

We use a pre-trained ResNet-50 model to detect dogs in images. The code downloads the ResNet-50 model, along with weights that have been trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks.

The notebook provides some pre-processing steps to convert the data into proper shape. Like the previous step, we write a function to detect dogs and use it to check what percentage of human and dog files have a detected dog face.

Step 3: Create a CNN to Classify Dog Breeds (from Scratch)

Now that we have functions for detecting humans and dogs in images from the previous steps, we need a way to predict breed from images. In this step, we create a CNN that classifies dog breeds. We don’t use transfer learning yet, this will be done in a further step. The test accuracy of at least 1% (yes! 1%) is set as the bar. Why so low? Read on and you will find out!

Step 4: Use a CNN to Classify Dog Breeds

This is one is given just as a reference purpose. The notebook proposes an approach to reduce training time without sacrificing accuracy. Here too, the bar isn’t set too high for test accuracy (Read on and you will find out what we achieve with this one).

Step 5: Create a CNN to Classify Dog Breeds (using Transfer Learning)

Now we use transfer learning to create a CNN that can identify dog breed from images. The CNN we create must attain at least 60% accuracy on the test set. Ideally, the accuracy rate is even higher (~90–95% at least).

To make things easier, the notebook provides pre-computed features for the following networks that are currently available in Keras:

VGG-19 bottleneck features
ResNet-50 bottleneck features
Inception bottleneck features
Xception bottleneck features

Using one of these, we create our model.

Step 5: Write your Algorithm

Here, we write an algorithm that accepts a file path to an image and first determines whether the image contains a human, dog, or neither. Then,

if a dog is detected in the image, return the predicted breed.
if a human is detected in the image, return the resembling dog breed.
if neither is detected in the image, provide an output that indicates an error.

We use the model from the previous step. A sample image and output for our algorithm is provided below.

Sample input and output from our custom algorithm

Step 7: Test Your Algorithm

Here the algorithm is put to the test. Several questions can be asked here, like what kind of dog does the algorithm think that you look like? If you have a dog, does it predict your dog’s breed accurately? If you have a cat, does it mistakenly think that your cat is a dog?

Data Analysis

Data Exploration and Visualization

The datasets are already provided to us from these sources: Dog Images and Human Images.

The full dataset used 8,351 images of 133 categories of dogs. The data is separated into three folders for training, validation, and test set. The load_files function from the scikit-learn library is used to import the datasets.

The print statements above give us the exact numbers

There are 133 total dog categories.
There are 8351 total dog images.
There are 6680 training dog images.
There are 835 validation dog images.
There are 836 test dog images

To see the maximum, minimum and average number of dogs per category, the following code does the work

Maximum number of dogs per category is 96.0
Minimum number of dogs per category is 33.0
Average number of dogs per category is 62.79

Visualizing the same gives us the following graph

I chose not to print out the category label because it would hard to read due to their large number, even in the horizontal bar graph.

Methodology

Data Preprocessing

When using TensorFlow as backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input, with the shape

(nb_samples, rows, columns, channel)

where nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.

The path_to_tensor function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224 x 224 pixels. Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have the shape (1, 224, 224, 3)

The paths_to_tensor function takes a NumPy array of string-valued image paths as input and returns a 4D tensor with shape (nb_samples, 224, 224, 3)

Here, nb_samples is the number of samples, or the number of images, in the supplied array of image paths. It is best to think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image) in the dataset.

Getting the 4D tensor ready for ResNet-50, and for any other pre-trained model in Keras, requires some additional processing. First, the RGB image is converted to BGR by reordering the channels. All pre-trained models have the additional normalization step that the mean pixel (expressed in RGB as [103.939, 116.779, 123.68] and calculated from all pixels in all images in ImageNet) must be subtracted from every pixel in each image. This is implemented in the imported function preprocess_input. If you're curious, you can check the code for preprocess_input here.

Now that we have a way to format our image for supplying to ResNet-50, we are now ready to use the model to extract the predictions. This is accomplished with the predict method, which returns an array whose i-th entry is the model’s predicted probability that the image belongs to the i-th ImageNet category. This is implemented in the ResNet50_predict_labels function below.

By taking the argmax of the predicted probability vector, we obtain an integer corresponding to the model’s predicted object class, which we can identify with an object category through the use of this dictionary.

Implementation

This essentially has Steps 1, 2, and 3 mentioned in the Metrics section above.

In Step 1, we use OpenCV’s Haar feature-based classifier to detect a human face in a sample image.

The output looks something like this

We then use a function that would return True or False if a human face is detected. Sometimes we just wanna know if a human face is present in the image and that’s where this function comes in.

In Step 2, we make use of a pre-trained ResNet50 model from Keras to detect Dogs in images.

Before moving on to the next step, we preprocess the data. The preprocessing details are shared in the preprocessing section above.

While looking at the dictionary, you will notice that the categories corresponding to dogs appear in an uninterrupted sequence and correspond to dictionary keys 151–268, inclusive, to include all categories from 'Chihuahua' to 'Mexican hairless'. Thus, in order to check to see if an image is predicted to contain a dog by the pre-trained ResNet-50 model, we need only check if the ResNet50_predict_labels function above returns a value between 151 and 268 (inclusive).

In Step 3, we use these functions in our CNN built from scratch. Before we start building the model, we rescale the images by dividing every pixel by 255.

For the architecture, I use the example given in the notebook as a reference to build the architecture.

The architecture looks something like this:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 223, 223, 16)      208       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 111, 111, 16)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 110, 110, 32)      2080      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 55, 55, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 54, 54, 64)        8256      
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 27, 27, 64)        0         
_________________________________________________________________
global_average_pooling2d_1 ( (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 133)               8645      
=================================================================
Total params: 19,189
Trainable params: 19,189
Non-trainable params: 0
_________________________________________________________________

Since this is a fairly small model, compiling and training the model doesn't take very long. However, training time depends on the hardware configuration.

Now, the output of the training is very verbose so I won’t post it here.

To test the accuracy of the model, we load the saved model and run it against our test data.

Here, we get a test accuracy of 4.4258%. This would be considered essentially useless but since the bar was only to get an accuracy of >1%, this is fine.

Refinements

This would go through Steps 4 and 5 from the Metrics section above as you will see that our accuracy increases quite a bit!

In Step 4, we train a model using transfer learning. Here, instead of rescaling the image, we extract bottleneck features from the model’s npz file which is provided in the notebook.

The rest of the steps for this and the next Step are similar, so I will skip the code sections for this one as it would look quite repetitive. Also, the entire Step 4 was completed for us in the notebook.

Training this model got us an accuracy of 40.4306%. I use Step 4 as a reference to build Step 5 which is where our accuracy increases a lot.

In Step 5, we create a custom ResNet50 model. Since the features are ready pre-computed and given to us, we use the corresponding features and extract the bottleneck-features.

I kept the architecture for this model pretty simple. Keeping a simple architecture allows for effective classification of the images with relatively low usage of computation power. Also, ResNet50 was chosen among VGG16, VGG19, and Inception due to its highest accuracy amongst the four.

The compilation, training, and testing the model is essentially identical to the one in Step 3.

With this model, we get an accuracy of 82.2967% which is pretty good considering the bar is only to get >60%.

Results

Model Evaluation and Validation

For model evaluation and validation, we work through Steps 6 and 7 where we write our own algorithm and test it.

To evaluate the model, we write a function identify_dog(img_path) that would identify dog breed stored at the image path and return the predicted breed of the dog.

Supplying a sample image would give us the following output:

Sample output for our custom ResNet50 model

Now, to use this function, we write an algorithm which brings us to Step 6. This algorithm would return the possible dog breed of the dog/human in the input image.

Now to test the algorithm(which essentially validates the model), which is Step 7, we use sample images stored in the repo and run by the algorithm one by one.

Justification

The output of the algorithm looks quite promising. Since the output is quite long, I’ll post a part of it that shows all three scenarios. You can check out the full output on the repo.

The algorithm outputs no human or dog detected in the cat picture as expected, Alaskan Malamute for the picture above which is correct and says that I look like Silky terrier…😅

Do I look like a silky terrier? 😄 Maybe not. But since the accuracy is just around 82%, I would expect too much.

Conclusion

The output is pretty good actually. The algorithm accurately reports the exact or the nearest breed it guesses. It was also able to tell apart dogs from humans and report errors when no dog or human detected.

Reflection and Improvements

In this project, we went through seven steps in testing and creation of an algorithm that would help us identify the breed of dogs and even tell a resembling breed when a human image is provided.

We first create basic functions that would tell humans and dogs apart. These functions would use Haar feature-based classifier from OpenCV and pre-trained ResNet50 model from Keras respectively. Using the ResNet50 model familiarized us with the preprocessing steps that are essential in getting the input in proper shape.

Then we created our own CNN model from scratch by preprocessing the images, defining, and architecting the model. Compiling, training, and testing the model is done afterward. These three steps are the same for every model we use here. Do note that we did not use transfer learning yet.

We then go through the process of training a model with transfer learning. Here the notebook helps us by giving an example of the whole process i.e. defining, architecting, compiling, training, and testing a model. Using this as a reference, we create our own ResNet50 model using the transfer learning technique. We choose ResNet50 for its accuracy, low computational needs, and simple architecture. This got us an accuracy of around 82% which is pretty good.

The most interesting and difficult part of this was to choose and architecting a model during transfer learning. We are given pre-computed features for various models. Each model can be architected in n number of ways from simple to extremely complex using these features and it would give us a big range of accuracies for each model. Keeping the architecture simple and using low computation power without sacrificing accuracy is a real challenge. We can go really deep in this and know which model to use, define, architect would require a lot of time and practice, which could come with experience.

The accuracy bar here was only >60% and we got ~82%. There are many ways in which this could be improved even further:

Use data augmentation techniques that consisted of image translations, horizontal reflections, and mean subtraction.
Add the noise reduction steps in image pre-processing.
Choose the face or detected dog part in the image, and only apply the selected part of the image to the algorithm. In this way, all the background will not affect the prediction.
Detecting multiple dogs and there breeds. Same when there are multiple humans.
When there are human(s) and dog(s) in the same picture, detect the dog’s breed(s) and resembling dog breed(s) for the human(s).
Using larger datasets for better accuracy.

References

I used a lot of references while working on this project. Many of which I don't even remember. Although, some of them are as follows:

Credits must be given to Udacity for the starter codes and pre-computed files used by this project.
The outline of this article was an inspiration from this one: https://medium.com/@gopal.iyer0/robot-motion-planning-dsnd-capstone-project-234252e608b9
Stackoverflow questions (many of them) and github issues including https://stackoverflow.com/questions/51231576/tensorflow-keras-expected-global-average-pooling2d-1-input-to-have-shape-1-1, https://github.com/keras-team/keras-applications/issues/167
Course materials gave the idea for model architectures.
Keras library documentation for ResNet50, VGG19, and other functions

You can check out the code here: https://github.com/agpt8/identify-dogs