Complete guide to Object Detection using Deep Learning

4 min readNov 30, 2020

Object Detection

Object detection helps in the recognition, localization, and detection of multiple visual instances of objects in an image/ a video. It furnishes better understanding of the objects as a whole, instead of just providing the standard object classification. This method is used to count the total number of unique object instances and also simultaneously mark their precise locations, along with labeling. The performance of all these techniques has improved significantly over the past few years, aiding us with real-time applications. We are here just to answer the basic question, “What object is present where and how much of it is there in the given frame?”.

History of Object Detection

History of Object Detection In the past 20–30 years, the progress in the field of object detection has experienced two significant development periods, starting from the early 2000s:

Traditional object detection (old school) — the early 2000s to 2014.
2. Deep learning-based detection- after 2014.

The technical growth of object detection techniques and approaches started in the early 2000s and the detectors at that time followed the low-level and mid-level vision following the method of ‘recognition-by-components’. These methods startled object detection as a measurement of similarity between the components of the objects, their shapes, and contours, and all features that were taken into consideration were based on distance transforms, shape contexts, and edgeless, etc.

The task of multi-scale detection of objects was to be accomplished by taking those objects into consideration that had “different sizes” and “different aspect ratios” in a single frame. This was one of the main technical challenges in object detection in the early phases. But, after 2014, with the increase in technical advancements, the problem was solved. All these limitations were successfully overcome by the use of deep learning techniques.

Concept

The core concept behind object detection is that every object has its features and we aim to capture them. All these features will hel segregate different objects from each other. Similar concepts are used for things like face detection, fingerprint detection, etc.

Let us take an example, if we have two cars on the road, using the object detection algorithm, we can classify and label them.

Methods for Object Detection

Object detection can be achieved by using both machine learning and deep learning approaches. The machine learning approach requires the features to be defined by using various methods and then using any technique such as Support Vector Machines (SVMs) to do the classification. Whereas, the DL based approach can capture the whole detection process without explicitly defining the features to do the classification. The deep learning approach is solely based upon the use of Convolutional Neural Networks (CNNs).

Deep Learning Methods

Region Proposals (R-CNN, Fast R-CNN, Faster R-CNN)
You Only Look Once (YOLO)

Region-based Convolutional Neural Networks (R-CNN)

There are many different object detection models under the umbrella of R-CNN. All of these detection models are based on the core region proposal structures. These features have been significantly improved with time in terms of increasing accuracy and efficiency.

The different models under R-CNN are:

R-CNN

The R-CNN method employs a process called selective search to estimate the objects from the image/frame. This algorithm generates a large number of regions and collectively works on them. These collected regions are then checked for presence of objects if they have any object. The success of this method depends on the accuracy of the classification of objects.

Fast-RCNN

The Fast-RCNN method uses the structure of R-CNN along with the SPP-net (Spatial Pyramid Pooling) to make the slow R-CNN model faster. The Fast-RCNN uses the SPP-net to estimate the CNN representation of the whole image in one go. Further it uses this calculated representation to calculate the CNN representation for every patch generated by using the selective search approach of R-CNN. The Fast-RCNN makes the process train from end-to-end.

The Fast-RCNN model inculcates the use of bounding box regression in accordance with the training process. This helps converge the processes of localization and classification in single process, thus making the entire process faster.

Faster-RCNN

The Faster-RCNN method described below works even faster than the previous Fast-RCNN method. The Fast-RCNN method was fast but had a limitation of the process of selective search and this process is replaced in Faster-RCNN by efficiently implementing the Region Proposal Network (RPN) module. The RPN makes the process of selection faster by implementing a small convolutional network, which in turn, generates regions of interest. Along with RPN, this method employs Anchor Boxes to effectively handle the multiple aspect ratios and scale of objects in a frame. Faster-RCNN is one of the most accurate and efficient object detection algorithms.

You Look Only Once (YOLO) Family

The YOLO framework focuses on the entire image as a whole and predicts the bounding boxes, then calculates its class probabilities to label the boxes. The family of YOLO frameworks is very fast object detectors.