Traditional vending machines run on simple mechanics. A customer presses a physical button tied to a fixed slot. A motor pushes the product forward into the pickup bay.
These machines offer limited selection. The shopping experience is constrained by the physical layout of the mechanical system.
Computer vision enables a grab-and-go flow. Customers browse shelves freely and pick items just like in a store. The system calculates charges in real time and completes the transaction automatically when the customer closes the door.
In this post, we'll cover the computer vision methods behind this process, along with the data and annotation required to train these systems.

What is a smart vending machine?
A smart vending machine is an automated retail unit that operates without staff. Customers complete purchases on their own. Compared with classic vending machines, it combines IoT, edge AI vision, and mobile payment.
The smart cabinet is a typical example. Customers verify their identity first. The door unlocks. They open it, take products from the shelves (or pick something up then put it back), and close the door. The system then identifies what was taken and charges.
In practice, there are three main technical routes:
Shelf weight sensors detect product removal by measuring weight changes.
RFID tags attached to each item are scanned when the door closes.
Vision-based recognition uses cameras inside the cabinet combined with AI algorithms to infer what the consumer took.
Here, let's focus on the vision-based approach. It avoids mechanical failures associated with weight sensors and the cost of damaged electronic tags.
How does vending machine vision detect which products were taken?
The industry typically uses two approaches: static vision and dynamic vision.
Static vision compares shelf images before and after the customer opens the door to identify which products are no longer visible. This often requires one camera per shelf for reliable coverage.
In real deployments, subtle lighting changes or product displacement can trigger false positives. Static methods also struggle to capture intent changes in-session, such as picking an item up and placing it back.
Dynamic vision does not compare static shelf states. It continuously detects and tracks customer interactions, especially hand motion, while the door is open.
Dynamic systems handle occlusion better. Even if a product region is blocked, hand tracking can continue and preserve the interaction timeline.

Most commercial smart vending machines combine static and dynamic vision to maximize robustness and accuracy. When dynamic vision detects a hand removing a product from the shelf, the system also analyzes images from before and after the door closes to confirm the product has actually left the shelf.
What computer vision models are used in smart vending machines?
Real-time inference and edge deployment are essential. The system must complete recognition and billing within seconds after the customer closes the door. Models run on edge devices inside the cabinet. Several model types are typically involved.
Object detection models form the foundation of the vision system. They locate products on shelves and generate bounding boxes with SKU labels. Common choices include Faster R-CNN (often with a ResNet50 backbone) for higher accuracy and the YOLO series (such as YOLOv8 and YOLOv9) for faster inference. YOLO models are better suited for edge deployment.
Hand detection + action recognition models track hand position and pose in real time. They identify interactions such as reaching, grasping, and returning items. More advanced systems use hand skeleton detection (often 21 key points). This turns object detection into active behavior understanding.
Instance segmentation models, like Mask R-CNN, provide pixel-level object masks rather than simple bounding boxes. They help when items overlap or have irregular shapes. Due to higher computational cost, segmentation is typically reserved for the verification stage.
Image classification models serve as confirmation after object detection. They answer what exactly one product is. This is critical for distinguishing visually similar items, such as different flavors of the same beverage.
What data is needed to train a vending machine vision system?
A production-grade system needs a large, carefully collected image dataset. Operators usually treat these growing datasets as proprietary.
Industry best practice typically calls for around one hundred different images per SKU. A mid-sized vending machine dataset may require 10k to 30k training images, and real-world variation often pushes that higher.
Smart vending machines operate in diverse environments with varying lighting. Data should cover multiple camera angles and illumination conditions to reduce domain shift.
Reflective materials (such as metal cans and transparent bottles) and different packaging types need targeted capture, so the model learns stable features rather than one specific highlight pattern.

Training data must also include dense shelf scenes covering realistic situations like product occlusion, overlap, and tilting. Because product catalogs update, operators may need to add thousands of images monthly and retrain models regularly.
What types of annotation are required for retail product recognition?
Bounding Box Annotation
Annotators draw bounding boxes around each product and assign SKU labels. In many pipelines, boxes are expected to align closely with the visible edges.
Intersection over Union (IoU) is the primary quality metric. Production workflows typically enforce minimum IoU thresholds. Detailed guidelines are needed to handle edge cases such as product shadows and partially visible items.
Polygon Annotation and Segmentation
Polygon annotation uses connected points to outline irregular product shapes (such as bottles or unusually shaped snacks). Instance segmentation requires pixel-level masks that precisely identify each product.
Segmentation takes three to five times longer than bounding box annotation. Most commercial systems rely primarily on bounding boxes and use segmentation only for specific scenarios, such as distinguishing similar adjacent products.
Hand Keypoint Annotation
Hand detection systems require annotation of 21 anatomical keypoints (wrist, knuckles, fingertips, and others). Annotators need anatomical knowledge. Even minor errors can corrupt the entire pose representation. Large-scale datasets may use automated tools to generate initial annotations, but production datasets still require human review and correction.
Classification
Semantic labels can be at the individual SKU level or grouped into broader categories. Maintaining naming consistency across hundreds or thousands of SKUs is challenging. Inconsistencies lead to blurred model decision boundaries. Professional platforms use hierarchical classification systems (category → subcategory → specific product) and track annotation consistency.
What datasets are available for training retail product recognition models?
Several public datasets target retail product recognition tasks. They provide labeled benchmark data for researchers and developers to build and evaluate computer vision models.
SKU-110k contains 110,000 categories with dense shelf images. It is suitable for scenarios where products are tightly packed, touching, or partially overlapping.
GroZi-120 is a classic retail recognition dataset. Its main feature is paired data: standard white-background product images alongside real shelf photos.
HoloSelecta contains 295 real product images from vending machines in the Zurich area, covering 109 categories with labeled bounding boxes.
Take Goods from Shelves (TGFS) is a hierarchical large-scale object detection dataset with 38,000 images, divided into 24 fine-grained categories and 3 coarse-grained categories.
Toward New Retail collected over 30,000 images from unmanned retail containers. It includes 155,153 manually annotated instances, focusing on beverage recognition.
Public datasets help, but most commercial operators cannot reach required accuracy using only public training data.
Operator catalogs contain many SKUs that are missing or underrepresented in public sets. Deployment environments also introduce specific lighting, shelf materials, and background characteristics. When models are deployed in environments different from training conditions, accuracy drops.
In regulated settings such as healthcare vending, companies may need auditable, company-specific datasets, and business constraints may rule out reliance on generic public data.
How to build a training dataset for vending machine AI vision quickly?
Data Collection
Prioritize the real deployment environment. Capture products across multiple viewpoints, so the model generalizes to any shelf position and camera angle in the cabinet.
A Stanford study emphasizes that even for hand-centric tasks, collecting data from multiple camera positions and viewpoints significantly improves model generalization.

Data Annotation
A single operator may need thousands of newly annotated images monthly to keep up with changing product assortments. Many teams partner with external labeling services.
Crowdsourcing can be cost-effective for simple, objective tasks. But workers unfamiliar with a specific retail catalog may confuse similar products and mislabel SKUs. Anonymity can also make systematic quality issues harder to isolate.
Managed data labeling teams provide higher efficiency and comprehensive quality assurance. BasicAI is one company offering fully managed annotation services, including bounding box annotation, segmentation, and SKU-level labeling, with accuracy guarantees above 99%. For mid-sized datasets, expect professional annotation to take two to four weeks.
For organizations with sufficient in-house annotation staff, internal annotation using smart annotation tools and platforms offers greater control.
The BasicAI data annotation platform provides model-assisted annotation. Pretrained models generate initial annotations, which human annotators then review and correct rather than annotating from scratch.
The platform offers private deployment options for organizations with confidentiality requirements, so proprietary catalogs and images can stay on private infrastructure.
Continuous Improvement
Initial dataset and training mark only the beginning of production deployment for a smart vending machine. As new products arrive, environments shift, and monitoring surfaces failure cases, the system needs ongoing iteration.
Vision algorithm engineers may need to add more diverse lighting conditions to training data, increase images of problem products, or retrain models to adapt to detected environmental changes.
FAQs
How do vending machine vision systems handle visually similar products?
A common approach is to first run object detection to locate products, then apply fine-grained classification to analyze subtle features such as label colors and text patterns. Distinguishing similar products requires far more training images per variant than basic detection. For cases with confidence scores near decision boundaries, the system triggers secondary verification or human review.
What else can AI vision systems in vending machines do?
Beyond product recognition, AI vision systems can check brand purity by detecting competitor products on brand-exclusive shelves or verifying compliance with display agreements. They can also generate restocking recommendations based on inventory levels. Some advanced systems analyze customer behavior patterns, such as which products are picked up and returned, providing data to support assortment optimization.
How do smart vending machines handle products that customers pick up and then return?
Dynamic vision continuously track hand motion and can distinguish between taking and returning behaviors. The system monitors the complete path of a product from the moment it is picked up until it leaves the shelf area. If the item is returned to the original location (or placed elsewhere), the model detects the reverse action and updates the cart and inventory state. This is critical for preventing false charges.





