Dog Breed Identification: Whitney Larow Brian Mittl Vijay Singh
Dog Breed Identification: Whitney Larow Brian Mittl Vijay Singh
1
SIFT greyscale descriptors is used over each eye and nose.
After the eyes and nose have been detected, greyscale SIFT
descriptors around the keypoints are used as features by an
SVM classifier. With this approach, Liu et. al is able to
classify their test dataset with an accuracy of about 90
2.2. Comparison
Liu et. als results are certainly very impressive. How- Figure 1. An image representation of our analysis pipeline
ever, significant effort is put into identifying the dogs face
and then its keypoints. In fact, Liu et. al identify the fail-
ure of dog face detection as the primary bottleneck in their ear tip. Thus, the goal of this part of the analysis pipeline is
pipeline. defined as follows: given an unseen image from the testing
In our work, we attempt to use a convolutional neural set, predict the keypoints of the dog face as close as pos-
network to assist with keypoint detection in dogs, namely sible to the ground truth points in terms of pixels. Convo-
identifying eyes, nose, and ears. CNNs have seen success lutional neural networks have been shown to perform quite
in identifying facial keypoints in humans, and we hope to well on a variety of image detection and classification tasks,
apply this technique to dogs as well. [4] By doing so, we including human face detection, but previous literature on
eliminate dog face detection as a step in the process, and dog face detection had not used neural networks; hence,
replace Liu et. als part localization process. we decided to tackle a novel approach for dog face key-
For classification, We utilize multinomial logistic regres- point detection. To solve this problem, we trained a fully
sion and nearest neighbor models for classification, neither connected convolutional neural network on 4,776 training
of which are used by Liu et. al. With our own novel ap- images, which was used to predict the keypoints of 3,575
proach, we hope to match the success of Liu et. al. testing images. Before training, the images were all scaled
to be 128x128 pixels; furthermore, the pixel intensity val-
ues were scaled to be in the range [0,1]. The ground truth
3. Technical Solution
keypoints were also scaled accordingly. The neural network
3.1. Summary was constructed using the nolearn lasagne API. [10]
It performed regression using a mean squared error loss
To solve this fine-grained classification problem, we de- function defined as:
veloped the analysis pipeline seen in figure 1. First, we
trained a convolutional neural network on images of dogs n
1X
and their annotated facial keypoints (right eye, left eye, M SE = (Ŷi − Yi )2
n i=1
nose, right ear tip, right ear base, head top, left ear base, and
left ear tip). We used this network to then predict keypoints The network was trained using batches of 180 images
on an unseen test set of dog images. These predicted key- for 4,000 epochs. In each batch, half of the images and
points were then fed into a feature extraction system that their corresponding keypoints were flipped. This augmen-
used these keypoints to create more meaningful features tation created a larger training set, which helped prevent the
from the image, which could later be used to classify the network from overfitting the data given. The architecture
image. The primary features extracted were grayscale SIFT of the network went through several iterations, and the final
descriptors centered around the each eye, the nose, and the construction is shown in figure 2.
center of the face (the average of these 3 points). These
features were then used for a variety of classification algo- 3.2.2 Feature Extraction
rithms, namely bag of words, K-nearest neighbors, logistic
regression, and SVM classifiers. These algorithms classify Once we had detected the facial keypoints, we used them
the images as one of 133 possible dog breeds in our dataset to extract meaningful features about the image, which we
and complete the analysis pipeline. would later use in classification.
2
Figure 2. The architecture of our convolutional neural network
at the left and right eyes, the nose, and the center of the face Figure 3. An American Staffordshire Terrier from our dataset and
(calculated as the center of these three points). We rotated its corresponding color histogram (divided into RGB channels).
our SIFT descriptors to match the rotation of the dogs face
(calculated as the rotation away from horizontal for the line
connecting the two eyes). We also sized our descriptors to
be half the distance between the two eyes. We used Pythons
cv2 library from OpenCV to calculate these SIFT descrip-
tors.
RGB Histogram Figure 4. Color centers calculated using kmeans over the entire
Image color histograms are frequently used in image image (top) and just the dog’s face (bottom) on the training set.
search engines. Given how dog coloring can vary, we be- The size of the bar represents how many pixels in the training set
lieved it was worthwhile to explore using a color histogram belong to that color center.
as features for our classifier. To do so, we assumed an RGB
color space, with pixel intensity values ranging from 0 to
255. As such, we had 256 bins (one for each pixel value), We realized we were getting a lot of noise from back-
and three different histograms (one for each RGB channel). ground colors, which should have no effect in determining
We limited the color histogram to the face of the dog to en- a dogs breed, so we later refined this algorithm to consider
sure that the background would not interfere. In total, this only the pixels within the bounding box of the dogs face
produced 768 features (256 bins × 3 channels). (calculated using the facial keypoints). We re-calculated
the 32 color centers over only the pixels contained in the
face (see figure 4), and re-calculated our histograms to also
Color Centers Histogram cover only those pixels within the bounding box of the dogs
The next color histogram we tried implementing used k- face.
means to calculate the 32 color centers of all pixels over all
training images (see figure 4). The feature we then extracted
3.2.3 Classification
for each image was a histogram of pixel values where each
bin corresponded to a color center and contained all pixels Bag of Words Model
that were closest to that color center. We decided to use a bag of words model for our base-
3
line because it is a simple model that is frequently used for total number of breeds (we only need J − 1 since one of the
object classification. The bag of words model ignores our breeds will serve as a reference). Thus, the log probability
detected facial keypoints and instead just finds a visual vo- of dog i being breed j is:
cabulary based on the cluster centers of SIFT descriptors
obtained from the training set of images. Because bag of log P (i = j) = αj + x0i βj
words models perform better classification for objects with
very high inter-class variability, we did not expect it to work where aj is a constant, xi is dog i’s feature vector, and βj
very well for our fine-grained classification problem. is a vector of regression coefficients of breed j, for j =
For our bag of words model, we first used OpenCVs 1, 2, ..., J − 1.
FeatureDetector and DescriptorExtractor to extract SIFT Analogous to the standard logistic regression model, our
descriptors. We then used the K-means algorithm, from model’s probability distribution of response is multinomial
Pythons scipy library, to extract a visual vocabulary. We instead of binomial, and we use standard Maximum Like-
used this visual vocabulary to create feature histograms, lihood Estimation to predict the most likely breed. While
preprocessed and scaled these histograms to fit a gaussian such applications of logistic regression have been used
distribution, then used them to train an sklearn LinearSVC before in fine-grained classification, our literature review
model. showed no prior use in dog breed classification specifically.
To implement, we used scikit-learn’s linear model module
SVM in Python.
The first real classification model we tried was an SVM.
SVMs are commonly used machine learning models that K-Nearest Neighbors
perform well at multi-class classification. To date, few other Finally, we also attempted a nonlinear discriminative
supervised learning algorithms have outperformed SVMs, classifier using a K-nearest neighbor classifier. A dog is
which is why we thought this would be a good place to start. classified by a majority vote of its neighbors, with the dog
We tried a few different types of SVMs in order to deter- being assigned to the breed most common among its k near-
mine which worked best for our use case. The first we used est neighbors. As noted in lecture, this method depends on
was the same LinearSVC model from sklearn that we used training set being large enough to generate enough mean-
for our bag of words model. We then tried a normal SVM ingful votes and does not always produce the optimal re-
with a linear kernel and a OneVsRestClassifier, all using the sults. As such, we anticipated one of the linear classifiers
same sklearn library for consistency. All SVMs used the performing better.
grayscale SIFT features centered at facial keypoints and the
color centers histogram calculated around the dogs face. In 4. Experiments
our experimental results section, you can see that the normal
SVM with linear kernel outperformed the other two SVMs Dataset
while keeping the features consistent. For our experiments, we used the Columbia Dogs
Dataset as it provided the most robust data available on-
Logistic Regression line. The dataset contains 8,350 images of 133 different dog
As we learned in class, linear classifiers are a commonly breeds some of which are featured in figure 5. Each image
used discriminative classifier for categorizing images. Un- had a corresponding text file that annotated both bounding
like a standard linear regression problem, we have a finite boxes and keypoints for each dog face. The facial keypoints
number of discrete values (the various possible dog breeds) annotated were the right eye, left eye, nose, right ear tip,
that our predicted value y can take on. To make our predic- right ear base, head top, left ear base, and left ear tip, which
tion, we would use the following hypothesis function: we used to train our convolutional neural network for key-
point detection.
1
hθ (x) = g(θT x) =
1 + e−θT x
Keypoint Detection
Note, however, that the standard application of the sig- Qualitatively, the results of the keypoint prediction can
moid function for logistic regression is used for binary clas- be seen in the images in figure 6 where the red crosses are
sification (where y can take on only two discrete values). the predicted points and the green crosses are the ground
As we have 133 possible breeds, we extrapolate the above truth points. Quantitatively, we evaluated keypoint detec-
logic to handle multiple classes, also known as multinomial tion based on the average distance between the ground truth
logistic regression. keypoint and its predicted counterpart. On average, the neu-
Instead of having a single logistic regression equation, ral network predicted the keypoint to be 4.62 pixels from the
we have J − 1 = 132 equations, where J represents the ground truth point.
4
Figure 8. Examples of dog breeds where the convolutional neural
network performed poorly at detecting the ear tips.
5
Figure 12. Brown and yellow Labrador retrievers, which should be
classified as the same breed and are confounding our model when
we add color-dependent features.
6
point detection according to the literature. Our keypoint [3] N. Z. et al. Deformable part descriptors for fine-grained
predictions were 4.62 pixels from the ground truth points on recognition and attribute prediction. 2013 IEEE Interna-
average. The accuracy of our keypoint detection allowed us tional Conference on Computer Vision, 2013.
to succeed with our classification algorithms and is promis- [4] N. Z. et al. Part-based r-cnns for fine-grained category detec-
ing for future work in the area. Our vast experimentation tion. Computer Vision: ECCV 2014, pages 834–849, 2014.
with classification algorithms also provides novel contribu- [5] O. M. P. et al. Cats and dogs. 2012 IEEE Conference on
tions to the literature. The use of color histograms in fea- Computer Vision and Pattern Recognition, 2012.
ture extraction has never been done before; however, we [6] P. N. B. et al. Searching the world’s herbaria: A system for
visual identification of plant species. 2008.
found them to be unsuccessful due to the variety in color
[7] P. N. B. et al. Localizing parts of faces using a consensus of
for individuals of the same breed. We also experimented
exemplars, 2013.
with a variety of linear classifiers such as logistic regression
[8] R. F. et al. Birdlets: Subordinate categorization using volu-
and K-nearest neighbors for predicting dog breeds. These metric primitives and pose-normalized appearance. Pattern
contributions add a variety of techniques to the literature Recognition: 36th German Conference, GCPR 2014, 2014.
for dog breed identification some which should be explored [9] E. Gavves. Fine-grained categorization by alignments, 2013.
further as will be discussed next. [10] D. Nouri. Using convolutional neural nets to detect facial
keypoints tutorial, 2014.
5.2. Future Work
Future work should further explore the potential of
convolutional neural networks in dog breed prediction.
Given the success of our keypoint detection network, this
is a promising technique for future projects. That said,
neural networks take an enormous time to train and we
were unable to perform many iterations on our technique
due to time constraints. We recommend further exploration
into neural networks for keypoint detection, specifically
by training networks with a different architecture and
batch iterator to see what approaches might have greater
success. Also, given our success with neural networks
and keypoint detection, we recommend implementing a
neural network for breed classification as well since this
has not been performed in the literature. We were unable to
experiment with this approach due to the time constraints
of neural networks but believe that they would match if
not improve upon our classification results. Ultimately,
neural networks are time consuming to train and iterate
upon, which should be kept in consideration for future
efforts; still, neural networks are formidable classifiers
that will increase prediction accuracy over more traditional
techniques.
References
[1] A. Angelova and S. Zhu. Efficient object detection and seg-
mentation for fine-grained recognition. 2013 Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, 2013.
[2] J. L. et al. Dog breed classification using part localization.
Computer Vision: ECCV 2012, pages 172–185, 2012.