Contents
Two-Stage Object Detection Algorithms
One-Stage Object Detection Algorithms
Object detection is a key area in computer vision, primarily focused on target localization and target classification. Traditional object detection methods, which involved region selection, manual feature extraction, and classifier classification, often struggled to handle the diverse features of different objects. This limitation led to mediocre success rates in solving object detection problems.
The advent of deep learning revolutionized the field, enabling neural networks to automatically learn powerful feature extraction and fitting capabilities from large volumes of data. Consequently, numerous high-performance object detection algorithms have emerged. These algorithms can be supervised, semi-supervised, or unsupervised, depending on the volume of labeled data. However, supervised learning algorithms remain the most common, highlighting the critical role of high-quality data annotation for optimal learning results. This calls for a well-trained team to accurately define the region by drawing precise bounding boxes. Based on deep learning, object detection methods are typically divided into three categories: two-stage object detection, one-stage object detection, and transformer-based object detection. This article provides an overview of these three methods.
Two-Stage Object Detection Algorithms
Two-stage object detection algorithms differ from one-stage algorithms in that they first extract candidate regions from an image, then refine the detection results based on these regions. This process heavily relies on high-quality data annotation. Despite their higher detection accuracy, these algorithms are slower. The pioneer in this approach is RCNN [3], which was subsequently improved upon by Fast RCNN [4] and Faster RCNN [5]. Particularly, Faster RCNN remains a competitive algorithm in object detection due to its exceptional performance. Subsequent algorithms, such as FPN [6] and Mask RCNN [7], further improved upon Faster RCNN's limitations, enriching its components and enhancing its performance.
RCNN
RCNN, or Regions with Convolutional Neural Networks, marked the first application of deep learning in the realm of object detection. The central idea behind the algorithm is straightforward: For every image, the RCNN initially employs a selective search algorithm [1] to generate approximately 2000 candidate regions. These regions are resized to a consistent dimension, and their features are subsequently extracted using a Convolutional Neural Network (CNN). The regions then undergo classification via a Support Vector Machine (SVM) classifier, and linear regression models are deployed to generate more precise bounding boxes for each detected object.
However, despite its substantial advancements, RCNN is not without its limitations: Firstly, the comprehensive object detection process necessitates the use of three distinct models: a CNN for feature extraction, an SVM for object classification, and a linear regression model for the refinement of bounding boxes. As a consequence, RCNN can not be trained in an end-to-end manner. Instead, it requires separate training for each of these models, which complicates and prolongs the training process. Secondly, the extraction of 2000 region proposals and computation of CNN features for each region in an image results in a vast quantity of features. This abundance significantly reduces the model's inference speed. On average, it takes approximately 45 seconds per image for prediction, rendering RCNN impractical for utilization on large-scale datasets.
Fast RCNN
In the original RCNN model, each candidate region necessitates separate feature extraction using a Convolutional Neural Network (CNN). To mitigate the computational burden, Fast RCNN proposes a more efficient approach: it extracts features for all regions of interest using CNN only once per image. To accomplish this, Fast RCNN operates as follows: Initially, a heuristic algorithm is deployed to generate a substantial number of region proposals. The image is subsequently passed through a CNN to acquire image features. These features, combined with the relative positions of the region proposals, provide the respective region features. Through Region of Interest (ROI) pooling, these regional features are adjusted to a uniform size and then passed through a fully connected neural network. Ultimately, a softmax layer is introduced for object class prediction, and a linear regression layer is employed for bounding box refinement.
Fast RCNN significantly improves upon the computational efficiency of its predecessor. It can process a single image in merely 2 seconds, a stark contrast to RCNN's 45-second timeframe. However, it continues to rely on the selective search for region proposal generation, which remains a computationally intensive task.
Faster RCNN
Faster RCNN further refines the region proposal generation process by introducing the Region Proposal Network (RPN), optimizing both computational speed and detection accuracy. Specifically, Faster RCNN employs the following methodology: Initially, the image is run through CNN to yield high-level feature maps. These feature maps are then utilized by the RPN, which applies a sliding window approach on the feature maps with K differently-sized anchor boxes at each window location.
For each anchor box, the RPN predicts the probability of containing an object and the adjustment regression values for that specific box. This process generates bounding boxes of various shapes and sizes, which are then refined and passed through a fully connected layer to yield the final object class predictions and fine-tuned bounding boxes. Faster RCNN not only enhances computational speed but also achieves superior accuracy. Even today, it remains one of the primary algorithms in the field of object detection.
FPN
Faster R-CNN typically employs a single high-level feature map, downsampled by a factor of four, from a Conv4 convolutional layer for object classification and bounding box regression. Unfortunately, this approach falls short when detecting small objects, which inherently have limited pixel information. Downsampling may lead to the loss of this crucial information, resulting in decreased performance. To rectify this, Feature Pyramid Network (FPN) introduces a pyramid network structure that incorporates multi-scale features. This innovation enhances the detection performance for small objects with only a minor increase in computational complexity.
Specifically, Faster R-CNN uses the final layer's features as input to the Region Proposal Network (RPN). These features undergo a 3x3 convolution to obtain a 256-channel convolutional layer, followed by two 1x1 convolutions that produce class scores and bounding box regressions. In contrast, FPN feeds the features from P2, P3, P4, P5, and P6 (five feature maps with different downsampling factors) into the RPN. Each feature map corresponds to a unique downsampling factor, capturing different scale information. The 3x3, 6x6, 12x12, 24x24, and 48x48 anchor boxes correspond to P2, P3, P4, P5, and P6, respectively. This arrangement allows each feature map to handle a specific scale, reducing the burden on each map. Moreover, the authors of FPN discovered that sharing RPN parameters across these five feature maps yields results nearly identical to those achieved without sharing parameters. This finding suggests that features from different downsampling factors carry similar semantic information. Following its introduction, FPN quickly became a critical component of Faster R-CNN.
Mask RCNN
Building on Faster R-CNN, Mask RCNN incorporates the powerful combination of ResNeXt-101 and FPN as a backbone for feature extraction. It also adds an extra branch for mask prediction, further enhancing the feature extraction network and RPN performance. Furthermore, Mask RCNN refines the Region of Interest (ROI) pooling layer used in Faster RCNN with the introduction of the ROIAlign layer. In Faster RCNN, two integerizations occur: (1) The bounding box regression values output by the RPN, typically decimal numbers, are rounded to integers for simplicity. (2) During the ROI pooling layer, the integerized regions are uniformly divided into KxK cells, necessitating additional integerization of cell boundaries.
The authors of Mask RCNN argue that these two integerizations result in position deviations between the proposed boxes and the original RPN boxes, affecting the model's accuracy. To mitigate this issue, they propose the ROIAlign method, which avoids integerization and uses bilinear interpolation to obtain pixel values with floating-point coordinates.
One-Stage Object Detection Algorithms
One-stage object detection algorithms perform detection directly on the input image, trading off a degree of accuracy for increased speed. Due to the single-step detection process, a well-annotated dataset can help balance speed and accuracy. This approach was initially championed by YOLO [8], which was later enhanced by SSD [9] and RetinaNet [10]. The YOLO developers incorporated various optimizations into the YOLO algorithm, resulting in four advanced versions: YOLOv2 through YOLOv5. While the prediction accuracy of one-stage models may not rival that of two-stage object detection algorithms, YOLO's faster computation speed and real-time performance have led to its widespread industry adoption.
YOLO
Two-stage object detection involves numerous predefined anchor boxes and a dedicated RPN network for position refinement. This complexity and computational expense motivated the creation of YOLO, a one-stage object detection algorithm. YOLO processes an image once through a neural network to simultaneously predict object positions and classes, transforming the object detection task into an end-to-end regression problem and hence increasing computational efficiency.
Here's a typical YOLO workflow:
Resize the input image to a standard size and overlay a grid on it.
Extract image features using a convolutional neural network and use the grid cells to perform regression predictions with fully connected layers. Each grid cell predicts K boxes, each with five regression values. Four of these values represent the box's position, and the fifth indicates the confidence of the box containing an object and the accuracy of its position.
For cells containing objects, use a fully connected layer to predict the conditional probabilities of the object's classes.
Thus, the convolutional network outputs a total of N x (K x 5 + C) predictions, where N is the number of grid cells, K is the number of boxes predicted per cell, and C is the number of classes. Despite the drastic speed improvements offered by one-stage models, YOLO's coarse grid division can hinder its performance in detecting small objects. As a result, its overall performance falls short of that of Faster R-CNN.
SSD
To overcome YOLO's limitations, SSD incorporates strategies inspired by two-stage object detection algorithms. This approach enhances accuracy while preserving a high computation speed. Unlike YOLO, which uses a single high-level feature map for detection, SSD utilizes multiple feature maps from different layers of the feature extraction network to improve small object detection.
In YOLO, bounding box predictions are relative to the square grid cells, often leading to significant shape discrepancies compared to the actual objects. SSD borrows from the concept of anchor boxes in Faster R-CNN, assigning different scales and aspect ratios to anchor boxes for each grid cell to alleviate training difficulty. Additionally, SSD replaces fully connected layers with convolutional layers and performs regression predictions for different feature maps, reducing the model's parameter size and increasing computational speed. SSD's enhancements result in performance comparable to or better than Faster R-CNN. However, it has been surpassed by subsequent algorithms based on Faster R-CNN.
Retinanet
RetinaNet is a single-stage detector that uses Focal Loss to handle the issue of class imbalance, which is a key challenge in object detection tasks. The algorithm does not introduce new model layers, differentiating it from other models such as SSD. Object detection tasks often suffer from severe class imbalance issues between positive and negative samples. This is due to the fact that these algorithms densely sample each image position, and since the number of objects is limited, the number of candidate regions containing objects is significantly less than that of the background regions. Two-stage object detection models, such as Faster R-CNN, partially mitigate this issue through the use of the Region Proposal Network (RPN). The RPN reduces the number of negative samples, allowing the final detection module to process only a small number of candidate boxes.
One-stage object detection models such as SSD attempt to address class imbalance by employing hard negative mining. This process involves selecting the top-k negative samples with the highest loss from a large pool of negative samples, with the aim of maintaining a positive-to-negative ratio of about 1:3. RetinaNet, on the other hand, uses a novel loss function called Focal Loss to handle the class imbalance problem. Focal Loss dynamically adjusts the cross-entropy loss based on the predicted confidence. As the confidence of correctly predicted samples increases, the weight coefficient of the loss gradually decreases to zero. This approach focuses training on difficult examples while minimizing the contribution of easy, correctly classified examples. RetinaNet also uses a Feature Pyramid Network (FPN) as its backbone to efficiently detect objects at different scales and the classification subnet to classify the objects. This makes RetinaNet an effective tool for object detection tasks involving various object sizes.
YOLO's Subsequent Versions
The YOLO team implemented various optimizations into the YOLO algorithm, culminating in the development of four upgraded versions: YOLOv2 through YOLOv5. YOLOv2 introduced several enhancements, including Batch Normalization, high-resolution image classification, the use of anchor boxes, clustering for extracting scale information, constraints on predicted bounding box positions, passthrough layers for detecting fine-grained features, and hierarchical classification. YOLOv3 integrated ideas from SSD, utilizing multi-scale features for object detection. Unlike SSD, however, YOLOv3 employs upsampling and feature fusion to combine features at different scales. Subsequent versions of YOLO incorporated the Focal Loss technique proposed by RetinaNet to address the class imbalance issue, thereby further enhancing the model's performance.
Transformer-based Approaches
The relationships between objects can potentially enhance detection accuracy in object detection tasks. However, both one-stage and two-stage object detection algorithms did not effectively utilize attention mechanisms to capture object relationships before the advent of Transformer models. One potential reason is the challenge of modeling the relationships between objects, as the positions, scales, categories, and quantities of objects vary across images. Most modern CNN-based methods have a simple, regular network structure that struggles to handle these complex phenomena. To address this, Relation Net and DETR introduced Transformers to model the relationships between different objects and incorporate attention mechanisms into object detection.
Relation Net
Despite using Transformers, Relation Net is still an enhancement of Faster RCNN. Faster RCNN generates a series of region proposals using the RPN network, which are then fed into a neural network to predict object positions and classes. Instead of directly inputting the region proposals into the final prediction network, Relation Net first passes these proposals through a transformer. The attention mechanism is employed to fuse the relationship information between different region proposals, enhancing the features. The transformer's output is then passed to the final prediction network for predicting object positions and classes. Furthermore, the transformer module replaces the non-maximum suppression (NMS) module, achieving end-to-end training.
DETR
DETR (DEtection TRansformer) represents a significant work that employs Transformers for object detection tasks.
DETR first uses CNN to extract features from the image and adds position encoding, a technique commonly used in natural language processing (NLP), to generate a serialized set of data.
In the encoder stage, the serialized data is fed into the encoder, which extracts features using attention mechanisms.
In the decoder stage, N randomly initialized object queries are inputted, with each query attending to different positions in the image. After the attention mechanism, the decoder outputs N bounding box predictions and class predictions.
Finally, a Hungarian matching algorithm is used to assign the predicted boxes to the ground truth boxes.
DETR completely abandons the use of anchor boxes, region proposals, and NMS modules. Instead, it treats object detection as a direct set prediction problem, eliminating the need for post-processing steps such as anchor adjustment and non-maximum suppression.
Conclusion
The field of object detection has seen significant advancements in recent years, largely due to progress in deep learning. Two-stage object detection algorithms such as Faster RCNN, FPN, and Mask RCNN deliver high accuracy but tend to be relatively slower. Conversely, one-stage object detection algorithms like YOLO, SSD, and RetinaNet prioritize speed but may sacrifice some accuracy. Transformer-based approaches like Relation Net and DETR employ attention mechanisms and model object relationships to enhance object detection performance. The decision of which algorithm to use depends on the specific requirements of the application. If accuracy is the top priority and real-time performance is not critical, two-stage object detection algorithms may be suitable. On the other hand, if real-time performance is crucial and a minor decrease in accuracy is acceptable, one-stage object detection algorithms like YOLO or SSD could be considered. Transformer-based approaches are relatively new and still being explored, but they hold promise for further advancements in object detection.
Reference
Everingham, M., Eslami, S., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. (2015). "The Pascal Visual Object Classes Challenge: A Retrospective." International Journal of Computer Vision, 111(1), 98–136.
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, L. (2014). "Microsoft COCO: Common Objects in Context." In European Conference on Computer Vision (pp. 740–755).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." In Conference on Computer Vision and Pattern Recognition (pp. 580–587).
Girshick, R. (2015). "Fast R-CNN." In International Conference on Computer Vision (pp. 1440–1448).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN: Towards Real Time Object Detection with Region Proposal Networks." In Neural Information Processing Systems (pp. 91–99).
Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). "Feature Pyramid Networks for Object Detection." In Conference on Computer Vision and Pattern Recognition.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). "Mask R-CNN." In International Conference on Computer Vision.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). "You Only Look Once: Unified, Real Time Object Detection." In Conference on Computer Vision and Pattern Recognition (pp. 779–788).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. (2016). "SSD: Single Shot Multibox Detector." In European Conference on Computer Vision (pp. 21–37).
Lin, T., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). "Focal Loss for Dense Object Detection." In International Conference on Computer Vision.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, Ł., & Polosukhin, I. (2017). "Attention is All You Need." In Neural Information Processing Systems.
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). "Relation Networks for Object Detection." In Conference on Computer Vision and Pattern Recognition.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). "End-to-End Object Detection with Transformers." In European Conference on Computer Vision.