After the emergence of AlexNet in 2012, Convolutional Neural
Networks became the most efficient and widely used method for image recognition
tasks, and were much more superior than traditional image processing
techniques. ConvNets have shown a remarkable performance on image
classification tasks, in which given an input image and set of categories, the
network decides the strongest category present in the image.
The convolutional neural network architectures can easily be
trained to classify images. However, classifying images is not enough for the
task of object detection. For object detection purposes, each element in the
image has to be classified and localized. In order to do object detection, we
need an algorithm on top of ConvNets to do this. This section serves as an
introduction to the algorithms frequently used on top of ConvNets to detect,
localize and classify objects in images, along with a detailed discussion on
the algorithm selected by us for our task, Single Shot Multibox Detector (SSD).
One of the first techniques developed by researchers
developed to deal the tasks of object detection, localization and
classification were R-CNN’s. A R-CNN33 is a special type of CNN that
has the ability to locate and detect objects in images. The goal of R-CNN is to
take as input an image, and in that image identify correctly where the main
objects are in the image, via a bounding box.
The image below shows the output of a typical R-CNN:
Figure 2.2?5: An example output of R-CNN.
Image Source: https://towardsdatascience.com/understanding-ssd-multibox-real-time-object-detection-in-deep-learning-495ef744fab
How does it the R-CNN find out where to place the bounding
boxes? It basically proposes randomly a bunch of boxes in the image and checks
to see if any of them fit correctly. Once the region proposals for bounding
boxes have been generated, the images in the bounding boxes are passed through
a pre-trained model of AlexNet, and after that a Support Vector Machine (SVM),
which classifies the image in the box into one of the given classes. Once the
object has been classified, the bounding box is run through a linear regression
model to improve the bounding boxes by making them tighter. R-CNN works
reasonably well as far as accuracy of the bounding boxes is concerned, but it
is quite slow for as it requires a forward pass for every single region
proposal for each image (~2000 region proposals per image). Also, it is very
hard to train as it requires three different models to be trained separately,
the CNN which generates the features in every image, the classifier which predicts
the class, and the linear regression model which tightens the bounding boxes.
Figure 2.2?5: R-CNN Workflow. Image Source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html
The problems stated above were solved by the introduction of
Fast R-CNN34. Fast R-CNN built on the
previous works to classify object proposals regions much more efficiently. The
key idea which makes R-CNN faster was the technique used known as Region of
Interest (RoI) Pooling. It basically works by swapping the order of generating
region proposals and running the CNN. In this technique, the image is first passed
through a CNN and features of the region proposals are obtained from the last
feature map of the CNN. Also, in Fast R-CNN, the CNN, classifier and bounding
box regressor are trained jointly, where previously there were three different
models to extract image features, classify and further tighten the bounding
boxes. All three were computed in a single network in Fast R-CNN. Effectively,
this was significantly faster than R-CNN.
Faster R-CNN25 further improved upon the speed
of the previous techniques by bringing about an advancement in one of the remaining
bottlenecks, the region proposer. It speeds up the region proposal mechanism by
inserting a region proposal network (RPN) after the last convolutional layer. Effectively,
region proposals are produced by just looking at the last convolutional feature
map. From there onwards, the same pipeline is used as in R-CNN.
Figure 2.2?5: Faster R-CNN Workflow. Image
Single Shot Multibox Detector (SSD)
We now present the object detection and localization technique
used by us for our task of drowsiness detection. The technique is known as the
Single Shot Multibox Detector (SSD)35, and has been evaluated to have
much better performance and precision for object detection tasks. To begin our understanding
of SSD, we start with the explanation of the name:
Shot: This refers to the fact that object detection and localization is
done in a single forward pass of the
Box: This is the name of the technique developed by the authors for the
task of bounding box regression (i.e. making bounding boxes thinner)
The network is an object detector which also classifies the detected objects.
Figure 2.2?5: SSD architecture. Image
As shown in the figure above, the architecture of SSD builds
on the architecture of the VGG-16 architecture, but does away with the fully
connected layers. VGG-16 is the base network because it has very strong
performance in image classification tasks, and it is used very widely for transfer
It is a bounding box regression technique developed by the
authors of the paper, n MultiBox, the researchers “priors”, which are
pre-computed, fixed size bounding boxes that closely match the distribution of
the original ground truth boxes. These priors are selected in such a way that
their Intersection over Union ratio (IoU) is greater than 0.5. MultiBox starts
with the priors as predictions and attempt to regress closer to the ground
truth bounding boxes.
The resulting architecture contains 11 priors per feature
map cell (8×8, 6×6, 4×4, 3×3, 2×2) and only one on the 1×1 feature map,
resulting in a total of 1420 priors per image, thus enabling robust coverage of
input images at multiple scales, to detect objects of various sizes.
At the end, MultiBox only retains the top K predictions that
have minimised both location (LOC) and confidence (CONF) losses.