top of page

Computer Vision

YOLO Object Detection Algorithms 101: Part 1

YOLO (You Only Look Once) is one of the first single-stage object detection methods, delivering real-time results. YOLO object detection.




Admon W.

YOLO (You Only Look Once) is one of the first single-stage object detection methods, transforming the landscape by delivering real-time results. With an astonishing speed of 45 frames per second and a mean Average Precision (mAP) that leaves its competitors in the dust, YOLO redefined what was possible with deep learning-based object detection.

YOLO Object Detection Algorithms 101: Part 1

But what exactly is YOLO, and why has it caused such a stir? What's mAP, and for that matter, what's object detection? How can you leverage YOLO for your own projects?

Hold those thoughts.

In this series, we'll start by exploring object detection, followed by an in-depth look at YOLO architecture. Then, we'll guide you through data preparation and training your very own YOLO model.

Enough talk. Let's get started!

Into Object Detection: Two-Stage and Single-Stage Detectors

What is Object Detection?

Object detection is a computer vision technique for locating objects in an image or a video.

Unlike object classification, which only predicts the class label, or object segmentation, which generates a pixel-wise mask, object detection provides both the bounding box coordinates and the class label for each detected object.

Object Detection vs. Object Classification vs. Object Segmentation

Object Detection Performance Evaluation Metrics

To assess the performance of object detection models, two key metrics are used: Intersection over Union (IoU) for localization accuracy and Mean Average Precision (mAP) for classification correctness.

Intersection over Union (IoU)

IoU is a metric that quantifies the accuracy of object localization by measuring the overlap between the predicted bounding box and the ground truth bounding box. It is calculated as the ratio of the area of intersection to the area of the union of the two bounding boxes, with values ranging from 0 (no overlap) to 1 (perfect overlap). A higher IoU indicates better localization performance.

Confusion Matrix

The Confusion Matrix is a table that compares the actual classes of objects with the predicted classes. It consists of True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). Precision measures the proportion of correct positive predictions, while Recall measures the proportion of actual positives correctly predicted.

Confusion Matrix

Average Precision (AP)

Average Precision (AP) is a metric that combines Precision and Recall to provide a comprehensive evaluation of the classification performance. It is calculated by plotting the Precision-Recall curve and computing the area under the curve. The Precision-Recall curve shows the trade-off between Precision and Recall at different confidence thresholds. A higher AP indicates better classification performance.

Mean Average Precision (mAP)

When dealing with multiple object classes, the Mean Average Precision (mAP) is used. mAP is the average of the AP values across all object classes. It provides an overall measure of the object detection model's performance in correctly classifying objects across different categories.

Two Major Architectures for Object Detection Algorithms

Object detection methods can be categorized into two-stage and single-stage object detection based on their detection speed.

Two-stage object detection typically involves a two-step process: first, proposal generation and background elimination, followed by proposal classification and bounding box regression.

On the other hand, single-stage object detection merges these two processes into one, adopting an implementation framework of "anchors + classification refinement."

The main difference between one-stage and two-stage object detection algorithms is how they handle object detection tasks.

How Object Detection Works with Two-Stage Detectors

Two-stage object detectors, such as the R-CNN family of algorithms, operate in two distinct stages: region proposal generation and object classification with bounding box refinement. Using the R-CNN algorithm as an example, let's examine how these stages work.

Stage 1: Region Proposal Generation

In the first stage, the Selective Search algorithm is employed to generate regional proposals, also known as Regions of Interest (RoIs). Selective Search segments the image into multiple regions based on color, texture, and size similarity, and hierarchically combines similar regions to form larger regions, ultimately producing a set of RoIs.

Stage 2: Object Classification and Bounding Box Refinement

In the second stage, each RoI is resized and passed through a pre-trained Convolutional Neural Network (CNN) to extract features. These features are then fed into Support Vector Machine (SVM) classifiers to determine the presence of objects within each RoI. Additionally, a bounding box regression model is applied to refine the coordinates of the bounding boxes.

R-CNN: Regions with CNN Features

Limitations of Two-Stage Detectors

Two-stage detectors have several advantages, such as high accuracy, especially for larger objects, and flexibility in using different backbone networks and incorporating additional features. However, they also have some disadvantages:

  • Slower speed due to the two-stage process.

  • Complex training involving multiple steps.

  • Difficulty with detecting small objects accurately.

These limitations have led to the development of single-stage object detectors, such as YOLO (You Only Look Once), which simplify the object detection pipeline and improve detection speed while maintaining high accuracy.

YOLO (You Only Look Once) Algorithm: A Game-Changer in Object Detection

What is YOLO?

YOLO, which stands for "You Only Look Once," is an influential real-time end-to-end approach for object detection object algorithms in computer vision. This approach revolutionized object detection by enabling the detection task to be accomplished with a single pass of the network, unlike previous methods that required multiple passes or two-step processes.

Read Blog Post: Annotate Data for YOLO Model Training

How Does YOLO Object Detection Work?

The working process of YOLO can be divided into the following steps:

  1. Divide the input image into an S×S grid. If the center of an object falls within a grid, that grid is responsible for detecting that object.

  2. Each grid predicts B bounding boxes and their confidence scores. The bounding box is defined by five parameters: x, y, w, h, and confidence. Here, x and y define the position of the bounding box center relative to the grid, w and h define the width and height of the bounding box relative to the entire image, and confidence measures the presence of an object in the grid and the accuracy of the bounding box prediction.

  3. Each grid also predicts C conditional class probabilities. These probabilities are the likelihood of an object belonging to each class, given that an object is present in the grid.

  4. During testing, YOLO calculates the probabilities of each bounding box containing objects of each class by multiplying the confidence score of each box with the conditional class probabilities. Threshold filtering and non-maximum suppression are then applied to obtain the final detection results.

YOLO Architecture

Advantages of YOLO: Speed and Accuracy

🚀 Speed: As its name implies, YOLO only requires a single look at the image to detect objects. It applies a single neural network to the full image, which divides the image into regions and predicts bounding boxes and probabilities for each region. This makes YOLO extremely fast, capable of processing 45 frames per second, much faster than other object detection algorithms.

⏰ Real-Time Usage: Because of its speed, YOLO can be used in real-time applications that require immediate response, such as autonomous driving and real-time surveillance.

🔍 Accuracy: Despite its speed, YOLO makes impressive predictions. It outperforms other methods when generalizing from natural images to other domains, like artwork. It also performs well on object detection tasks that require recognizing small objects.

🧠 Global Context Understanding: Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time, so it implicitly encodes contextual information about classes and their appearance.

⚠️ Less Background Errors: YOLO also produces fewer false positives in the background than R-CNN and Fast R-CNN because it uses global context information to make predictions.

However, it's worth noting that YOLO also has its limitations. For instance, it struggles with small objects that appear in groups and objects with unusual aspect ratios or configurations.

Read: Annotate in YOLO Format

The Evolution of YOLO: From v1 to v9

Since its inception, the YOLO family has undergone a remarkable evolution, with each iteration building upon the previous versions to address limitations and enhance performance.

From the groundbreaking YOLOv1 to the latest YOLOv8 and beyond, let's dive in and unravel the fascinating story of YOLO's development.

A Timeline of YOLO Versions

YOLO: Pioneering Real-time Object Detection

The original YOLO algorithm, YOLOv1, introduced a real-time end-to-end approach for object detection by unifying the detection steps and predicting bounding boxes simultaneously across a grid of the input image. Each bounding box prediction included confidence scores, coordinates, height, and width relative to the grid cell, along with class predictions.

The model was trained on the PASCAL VOC dataset and utilized a loss function composed of localization, confidence, and classification losses.

YOLOv1's strengths included its speed and simplicity, but it had limitations in localization accuracy compared to other methods like Fast R-CNN. The training process involved pre-training the initial layers on ImageNet and fine-tuning on PASCAL VOC datasets with specific augmentations and loss components.

YOLO V2: Improving Upon the Original

YOLOv2, published in CVPR 2017, introduced several improvements to the original YOLO algorithm.

These enhancements included batch normalization on all convolutional layers for improved convergence and regularization, a high-resolution classifier that was fine-tuned on ImageNet with a resolution of 448x448 for better performance on higher resolution inputs, the use of anchor boxes to predict bounding boxes, which helped in predicting more accurate bounding boxes, and direct location prediction where the network predicted bounding box coordinates relative to the grid cell.

Objects Detected by YOLO9000

Additionally, YOLOv2 removed one pooling layer to obtain finer-grained features and used a fully convolutional architecture.

YOLO V3: Introducing Darknet-53 and Multi-Scale Predictions

YOLOv3, published in 2018 by Joseph Redmon and Ali Farhadi, introduced significant changes and a larger architecture to achieve state-of-the-art performance while maintaining real-time processing capabilities.

One key improvement in YOLOv3 was the introduction of objectness scores for bounding boxes using logistic regression, allowing for better localization and classification.

Additionally, YOLOv3 utilized binary cross-entropy for class prediction, enabling the assignment of multiple labels to the same box, which is beneficial for datasets with overlapping labels. The network featured a larger feature extractor with 53 convolutional layers (Darknet-53) and residual connections, providing more robust feature extraction capabilities.

YOLO V4: Redefining Accuracy with Anchor Framework

YOLOv4, introduced in 2020, utilized the CSPDarknet53 backbone architecture and achieved an average precision (AP) of 43.5% on the MS COCO dataset. Additionally, YOLOv4 featured a scaled-up model architecture called YOLOv4-large, which included three different sizes (P5, P6, and P7) designed for cloud GPUs. This architecture exceeded all previous models with a 56% AP on MS COCO.

YOLO V5: Balance of Speed and Accuracy

YOLOv5, released as version 7.0, includes versions capable of classification and instance segmentation. It is actively maintained by Ultralytics, with over 250 contributors and frequent improvements, making it easy to use, train, and deploy. YOLOv5x achieved an AP of 50.7% on the MS COCO dataset with an image size of 640 pixels and a speed of 200 FPS on an NVIDIA V100. By increasing the input size to 1536 pixels and using test-time augmentation (TTA), YOLOv5 reached an AP of 55.8% 17.

YOLOv5 Architecture (Source:
YOLOv5 Architecture (Source:

YOLOR: Multi-Task Learning with General Representation

YOLOR, which stands for You Only Learn One Representation, was published in ArXiv in May 2021 by the same research team of YOLOv4. It introduces a multi-task learning approach that aims to create a single model for various tasks by learning a general representation and using sub-networks to create task-specific representations. YOLOR achieved an AP of 55.4% on the MS COCO dataset test-dev 2017, showcasing the benefits of introducing implicit knowledge into the neural network for multiple tasks.

YOLOX: Anchor-Free YOLO

YOLOX, published in ArXiv in July 2021 by Megvii Technology, introduced several key changes compared to YOLOv3. These changes include an anchor-free architecture inspired by state-of-the-art object detectors, the use of center sampling for multi positives, a decoupled head separating classification and regression tasks, advanced label assignment inspired by Optimal Transport, and the incorporation of strong augmentations like MixUP and Mosaic augmentations.

YOLO V6: Efficiency Redefined with Enhanced Backbone

YOLOv6 introduced a new backbone with RepVGG blocks and utilized a decoupled head architecture. It also incorporated a bag-of-freebies approach, including planned re-parameterized convolution inspired by YOLOv6, batch normalization in conv-bn-activation, and implicit knowledge inspired by YOLOR. Additionally, YOLOv6 used an exponential moving average as the final inference model.

YOLO v6 Architecture

YOLO V7: Advanced Backbone Architecture

YOLOv7 was published in ArXiv in July 2022 by the same authors of YOLOv4 and YOLOR. YOLOv7 surpassed all known object detectors in speed and accuracy in the range of 5 FPS to 160 FPS. It was trained using only the MS COCO dataset without pre-trained backbones.

Extended Efficient Layer Aggregation Networks

YOLOv7 proposed architecture changes, including an Extended efficient layer aggregation network (E-ELAN) and Model scaling for concatenation-based models. The bag of freebies used in YOLOv7 included planned re-parameterized convolution, coarse label assignment for auxiliary head, batch normalization in conv-bn-activation, implicit knowledge inspired by YOLOR, and exponential moving average as the final inference model.

YOLO V8: Next-Gen Object Detection for Diverse Applications

YOLOv8 was released in January 2023 by Ultralytics, offering five scaled versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x (extra large). YOLOv8 supports multiple vision tasks such as object detection, segmentation, pose estimation, tracking, and classification.

YOLO v8 Comparison with Predecessors

The YOLOv8 architecture uses a similar backbone as YOLOv5 with changes in the CSPLayer, now called the C2f module. It employs an anchor-free model with a decoupled head to independently process objectness, classification, and regression tasks, improving overall accuracy. YOLOv8 utilizes CIoU and DFL loss functions for bounding box loss and binary cross-entropy for classification loss, enhancing object detection performance, especially for smaller objects.

Additionally, YOLOv8 provides a semantic segmentation model called YOLOv8-Seg, achieving state-of-the-art results on various benchmarks while maintaining high speed and efficiency.

YOLO V9: What's New and Limitations

The changes in YOLOv9 are relatively minor; it still bases its code architecture on YOLOv5. The main changes can be summarized in two points:

Programmable Gradient Information (PGI): Assuming the introduction of PGI can retain more input information within deep learning models, thus addressing the issue of information loss.

The GELAN Architecture

General Efficient Layer Aggregation Network (GELAN): This new lightweight network architecture, GELAN, enables efficient information flow and optimized parameter utilization, thereby reducing the demand for computational resources while maintaining or even enhancing detection accuracy.

Practical Insights on Deploying YOLO Object Detection in Real-World Applications

YOLO object detection holds immense potential for transforming industries with AI. Accurate and real-time detection and localization of key objects form the foundation for autonomous decision-making systems. With its exceptional performance, YOLO has become the go-to choice for many researchers and engineers.

In this section, we will explore 6 industry use cases of YOLO and delve into the thought process behind its practical deployment.

Autonomous Vehicles

Vehicle and Pedestrian Detection

Autonomous driving systems need to detect surrounding vehicles and pedestrians in real-time to avoid collisions. YOLO can be used to detect vehicles and pedestrians in images captured by onboard cameras. Combined with distance estimation and motion prediction, safe driving paths can be planned. Training data should cover various road scenarios, weather conditions, and lighting variations. To further improve detection accuracy and speed, enhancements like Feature Pyramid Networks (FPN) can be applied on top of YOLO.

Vehicle and Pedestrian Detection

Traffic Sign and Signal Recognition

This can be treated as a multi-class object detection problem. Common traffic signs and signals are defined as different classes, and YOLO is used for detection and classification. Considering the small size of traffic signs in images, appropriate data augmentation methods like image pyramids and multi-scale training should be employed. It's also important to account for variations in traffic signs across different countries and regions.

Surveillance and Security Systems

Intrusion Detection and Change Monitoring in Restricted Areas

YOLO detects human bodies in images and determines if they are within restricted areas. Training can be done using general datasets like COCO or datasets specifically designed for dense crowd scenarios, such as WiderPerson.

Beyond intrusion detection, analyzing trends in the number of people within the area is crucial. For example, monitoring changes from 1 person to 0, then to 1, 2, and finally back to 0. Controlling the time logic is important to avoid continuous alarms.

Fire Detection

A hybrid Gaussian background modeling method is used to distinguish static backgrounds from dynamic foregrounds, identifying potential fire regions. YOLO is then employed for smoke and fire detection, eliminating false positives caused by printed flame images, vehicle lights, sunsets, etc.

Fire Detection

Compared to traditional smoke alarms, visual detection can spot small flames earlier, preventing accidents before they occur. In the future, multi-spectral sensors can be incorporated to further improve detection accuracy.

Fall, Climbing, and Fighting Detection

Abnormal behaviors like falling, climbing, crossing fences, and fighting can be detected based on changes in human posture. The basic workflow involves human body detection with YOLO, background separation, and then using classification algorithms to determine posture. For a balance between speed and accuracy, pose estimation models like YOLO_Pose can be utilized.


Safety Helmet / Hard Hat Detection

Monitoring whether workers are wearing safety helmets correctly is crucial in construction sites and similar scenarios. The common approach is to use YOLO for human body or head detection, followed by classification algorithms like ResNet50 to determine if a helmet is worn.

Safety Helmet / Hard Hat Detection

Using YOLO alone can lead to false alarms, so it's important to train the model using a combination of open-source datasets and on-site images. Additionally, factors such as the model's generalization ability and lighting variations can impact detection performance.

Skin Exposure and Liquid Detection

YOLO is used to detect human bodies, and then background segmentation algorithms extract the human regions. By counting the number of exposed skin pixels below the head and comparing it to a threshold, an alarm can be triggered. Similarly, anomalies like liquid leaks and water accumulation can be detected.

Smart Agriculture

Crop Disease and Pest Detection

Crop diseases and pests significantly impact yield and quality. Traditional detection methods rely on manual inspections, which are inefficient.

Crop Disease and Pest Detection

With YOLO, diseases, and pests can be automatically detected in images captured by drones or fixed cameras, and their severity can be estimated. Common types of diseases and pests include leaf spots, rust, aphids, etc. Training data should cover crops at different growth stages and various disease and pest symptoms. Since lesions and insect bodies occupy small areas in images, appropriate data augmentation and small object detection strategies are necessary.

Agricultural Product Sorting and Quality Inspection

Harvested agricultural products require sorting and quality inspection to improve their commercial value. Using YOLO, real-time detection and classification can be performed on conveyor belts to remove substandard products automatically. Common agricultural product defects include appearance damage, irregular shapes, abnormal colors, etc.

Training data should contain normal and defective samples of various agricultural products. To accommodate the diversity of agricultural product morphology, instance segmentation, and other techniques can be introduced on top of YOLO. Additionally, real-time requirements need to be considered when selecting appropriate model sizes and inference devices.


Industrial Quality Inspection

Industrial defect detection is complex and task-specific. Two common approaches exist:

Traditional image segmentation + classification, suitable for scenarios with simple backgrounds or distinct foreground-background contrast.

End-to-end object detection or instance segmentation, suitable for complex backgrounds. After detecting suspected defects, further identification and localization are performed through classification, segmentation, morphological analysis, etc. The priority is to ensure a high detection rate and minimize missed detections, followed by efforts to reduce false alarms.


Object Detection for Robot Grasping

Precisely grasping target objects is a fundamental capability for robots. YOLO can be used to detect the objects to be grasped, and then depth information is utilized to estimate their 3D positions, enabling the planning of the robot's motion trajectory. Training data should include various types of target objects, as well as different backgrounds and angles.

Object Detection for Robot Grasping

To enhance grasping stability, further detection of the object's pose and contact points can be incorporated. In practical applications, a balance between detection speed and accuracy is necessary, leading to the selection of appropriate YOLO model variants.

Obstacle Detection for Robot Navigation

Indoor and outdoor environments often contain various obstacles, and robots need to plan safe navigation paths in real time. Using YOLO, obstacles can be detected in the robot's sensor data (e.g., LiDAR, depth cameras), and decisions can be made based on their categories and locations. Common obstacles include furniture, walls, pillars, pedestrians, etc. As robots face diverse environments, training data should cover different scenarios as much as possible. Additionally, sensor characteristics like noise and occlusion should be considered.

Applying YOLO to real-world scenarios involves considering factors such as data quality, algorithm selection, and compound task design. Accumulating domain knowledge and engineering experience is crucial. As object detection technologies like YOLO continue to evolve, we believe more innovative applications will emerge, bringing convenience to social security and production.

Key Takeaways

  1. Object detection algorithms can be categorized into two main architectures: two-stage detectors (e.g., R-CNN family) and single-stage detectors (e.g., YOLO), with the latter offering faster detection speeds.

  2. YOLO (You Only Look Once) is a breakthrough real-time object detection algorithm that processes images in a single pass, offering impressive speed and accuracy compared to previous multi-stage approaches.

  3. YOLO has undergone significant improvements from v1 to v9, incorporating advanced features and architectures to enhance detection accuracy and efficiency.

  4. Successful real-world deployment of YOLO requires careful consideration of data quality, algorithm selection, task design, and balancing speed and accuracy.

Stay tuned for our next post, where we'll walk you through preparing annotated data for training your YOLO model. We'll use a turtle detection project as a practical example to guide you every step of the way. Click the button below to dive right in!

Read Next: Data Annotation for YOLO

Get Project Estimates
Get a Quote Today

Get Essential Training Data
for Your AI Model Today.

bottom of page