Computer Vision

Computer Vision for Smart Vending Machines: How It Works and What Data You Need

Computer vision enables grab-and-go flow in smart retail. Learn how AI vision works in smart vending machines and the training data needed.

min

Admon W.

Traditional vending machines run on simple mechanics. A customer presses a physical button tied to a fixed slot. A motor pushes the product forward into the pickup bay.

These machines offer limited selection. The shopping experience is constrained by the physical layout of the mechanical system.

Computer vision enables a grab-and-go flow. Customers browse shelves freely and pick items just like in a store. The system calculates charges in real time and completes the transaction automatically when the customer closes the door.

In this post, we'll cover the computer vision methods behind this process, along with the data and annotation required to train these systems.

Computer Vision for Smart Vending Machines: How It Works and What Data You Need

What is a smart vending machine?

A smart vending machine is an automated retail unit that operates without staff. Customers complete purchases on their own. Compared with classic vending machines, it combines IoT, edge AI vision, and mobile payment.

The smart cabinet is a typical example. Customers verify their identity first. The door unlocks. They open it, take products from the shelves (or pick something up then put it back), and close the door. The system then identifies what was taken and charges.

In practice, there are three main technical routes:

Shelf weight sensors detect product removal by measuring weight changes.
RFID tags attached to each item are scanned when the door closes.
Vision-based recognition uses cameras inside the cabinet combined with AI algorithms to infer what the consumer took.

Here, let's focus on the vision-based approach. It avoids mechanical failures associated with weight sensors and the cost of damaged electronic tags.

How does vending machine vision detect which products were taken?

The industry typically uses two approaches: static vision and dynamic vision.

Static vision compares shelf images before and after the customer opens the door to identify which products are no longer visible. This often requires one camera per shelf for reliable coverage.

In real deployments, subtle lighting changes or product displacement can trigger false positives. Static methods also struggle to capture intent changes in-session, such as picking an item up and placing it back.

Dynamic vision does not compare static shelf states. It continuously detects and tracks customer interactions, especially hand motion, while the door is open.

Dynamic systems handle occlusion better. Even if a product region is blocked, hand tracking can continue and preserve the interaction timeline.

Simulated shopping process: Product and Hand Detection

Most commercial smart vending machines combine static and dynamic vision to maximize robustness and accuracy. When dynamic vision detects a hand removing a product from the shelf, the system also analyzes images from before and after the door closes to confirm the product has actually left the shelf.

What computer vision models are used in smart vending machines?

Real-time inference and edge deployment are essential. The system must complete recognition and billing within seconds after the customer closes the door. Models run on edge devices inside the cabinet. Several model types are typically involved.

Object detection models form the foundation of the vision system. They locate products on shelves and generate bounding boxes with SKU labels. Common choices include Faster R-CNN (often with a ResNet50 backbone) for higher accuracy and the YOLO series (such as YOLOv8 and YOLOv9) for faster inference. YOLO models are better suited for edge deployment.

Hand detection + action recognition models track hand position and pose in real time. They identify interactions such as reaching, grasping, and returning items. More advanced systems use hand skeleton detection (often 21 key points). This turns object detection into active behavior understanding.

Instance segmentation models, like Mask R-CNN, provide pixel-level object masks rather than simple bounding boxes. They help when items overlap or have irregular shapes. Due to higher computational cost, segmentation is typically reserved for the verification stage.

Image classification models serve as confirmation after object detection. They answer what exactly one product is. This is critical for distinguishing visually similar items, such as different flavors of the same beverage.

What data is needed to train a vending machine vision system?

A production-grade system needs a large, carefully collected image dataset. Operators usually treat these growing datasets as proprietary.

Industry best practice typically calls for around one hundred different images per SKU. A mid-sized vending machine dataset may require 10k to 30k training images, and real-world variation often pushes that higher.

Smart vending machines operate in diverse environments with varying lighting. Data should cover multiple camera angles and illumination conditions to reduce domain shift.

Reflective materials (such as metal cans and transparent bottles) and different packaging types need targeted capture, so the model learns stable features rather than one specific highlight pattern.

Training data must also include dense shelf scenes covering realistic situations like product occlusion, overlap, and tilting. Because product catalogs update, operators may need to add thousands of images monthly and retrain models regularly.

What types of annotation are required for retail product recognition?

Bounding Box Annotation

Annotators draw bounding boxes around each product and assign SKU labels. In many pipelines, boxes are expected to align closely with the visible edges.

Intersection over Union (IoU) is the primary quality metric. Production workflows typically enforce minimum IoU thresholds. Detailed guidelines are needed to handle edge cases such as product shadows and partially visible items.

Polygon Annotation and Segmentation

Polygon annotation uses connected points to outline irregular product shapes (such as bottles or unusually shaped snacks). Instance segmentation requires pixel-level masks that precisely identify each product.

Segmentation takes three to five times longer than bounding box annotation. Most commercial systems rely primarily on bounding boxes and use segmentation only for specific scenarios, such as distinguishing similar adjacent products.

Hand Keypoint Annotation

Hand detection systems require annotation of 21 anatomical keypoints (wrist, knuckles, fingertips, and others). Annotators need anatomical knowledge. Even minor errors can corrupt the entire pose representation. Large-scale datasets may use automated tools to generate initial annotations, but production datasets still require human review and correction.

Classification

Semantic labels can be at the individual SKU level or grouped into broader categories. Maintaining naming consistency across hundreds or thousands of SKUs is challenging. Inconsistencies lead to blurred model decision boundaries. Professional platforms use hierarchical classification systems (category → subcategory → specific product) and track annotation consistency.

What datasets are available for training retail product recognition models?

Several public datasets target retail product recognition tasks. They provide labeled benchmark data for researchers and developers to build and evaluate computer vision models.

SKU-110k contains 110,000 categories with dense shelf images. It is suitable for scenarios where products are tightly packed, touching, or partially overlapping.
GroZi-120 is a classic retail recognition dataset. Its main feature is paired data: standard white-background product images alongside real shelf photos.
HoloSelecta contains 295 real product images from vending machines in the Zurich area, covering 109 categories with labeled bounding boxes.
Take Goods from Shelves (TGFS) is a hierarchical large-scale object detection dataset with 38,000 images, divided into 24 fine-grained categories and 3 coarse-grained categories.
Toward New Retail collected over 30,000 images from unmanned retail containers. It includes 155,153 manually annotated instances, focusing on beverage recognition.

Public datasets help, but most commercial operators cannot reach required accuracy using only public training data.

Operator catalogs contain many SKUs that are missing or underrepresented in public sets. Deployment environments also introduce specific lighting, shelf materials, and background characteristics. When models are deployed in environments different from training conditions, accuracy drops.

In regulated settings such as healthcare vending, companies may need auditable, company-specific datasets, and business constraints may rule out reliance on generic public data.

How to build a training dataset for vending machine AI vision quickly?

Data Collection

Prioritize the real deployment environment. Capture products across multiple viewpoints, so the model generalizes to any shelf position and camera angle in the cabinet.

A Stanford study emphasizes that even for hand-centric tasks, collecting data from multiple camera positions and viewpoints significantly improves model generalization.

Data Annotation

A single operator may need thousands of newly annotated images monthly to keep up with changing product assortments. Many teams partner with external labeling services.

Crowdsourcing can be cost-effective for simple, objective tasks. But workers unfamiliar with a specific retail catalog may confuse similar products and mislabel SKUs. Anonymity can also make systematic quality issues harder to isolate.

Managed data labeling teams provide higher efficiency and comprehensive quality assurance. BasicAI is one company offering fully managed annotation services, including bounding box annotation, segmentation, and SKU-level labeling, with accuracy guarantees above 99%. For mid-sized datasets, expect professional annotation to take two to four weeks.

For organizations with sufficient in-house annotation staff, internal annotation using smart annotation tools and platforms offers greater control.

The BasicAI data annotation platform provides model-assisted annotation. Pretrained models generate initial annotations, which human annotators then review and correct rather than annotating from scratch.

The platform offers private deployment options for organizations with confidentiality requirements, so proprietary catalogs and images can stay on private infrastructure.

Continuous Improvement

Initial dataset and training mark only the beginning of production deployment for a smart vending machine. As new products arrive, environments shift, and monitoring surfaces failure cases, the system needs ongoing iteration.

Vision algorithm engineers may need to add more diverse lighting conditions to training data, increase images of problem products, or retrain models to adapt to detected environmental changes.

Computer Vision Data Annotation Services

FAQs

How do vending machine vision systems handle visually similar products?

A common approach is to first run object detection to locate products, then apply fine-grained classification to analyze subtle features such as label colors and text patterns. Distinguishing similar products requires far more training images per variant than basic detection. For cases with confidence scores near decision boundaries, the system triggers secondary verification or human review.

What else can AI vision systems in vending machines do?

Beyond product recognition, AI vision systems can check brand purity by detecting competitor products on brand-exclusive shelves or verifying compliance with display agreements. They can also generate restocking recommendations based on inventory levels. Some advanced systems analyze customer behavior patterns, such as which products are picked up and returned, providing data to support assortment optimization.

How do smart vending machines handle products that customers pick up and then return?

Dynamic vision continuously track hand motion and can distinguish between taking and returning behaviors. The system monitors the complete path of a product from the moment it is picked up until it leaves the shelf area. If the item is returned to the original location (or placed elsewhere), the model detects the reverse action and updates the cart and inventory state. This is critical for preventing false charges.

Back to All Posts

Get Essential Training Data
for Your AI Model Today.

Let's Talk

AI Training Data Solutions & Services

Overview of BasicAI’s professional, efficient and low-cost data annotation services for all types of training data and all industries.

Contact BasicAI to get project estimates and free pilot for your customized data labeling project.

End-to-end image/video annotation services for robust computer vision.

Leading 3D Sensor Fusion annotation services for autonomous systems.

Data labeling services for large language model and Gen AI training.

Get Project Estimates

BasicAI Data Annotation Platform

Overview of BasicAI’s all-in-one smart data annotation platform.

Explore the AI-powered labeling toolset for all types of AI training data.

See how BasicAI facilitates collaborative annotation project.

Learn about annotation tools designed for SFT, RLHF and classification tasks.

Tools for auto point cloud data labeling and semantic segmentation.

Choose the right plan for your teams, no matter how small or large.

Industries & Use Cases

Proprietary Data Engine
Prompt Delivery
Full Quality Assurance

Competitive Pricing
Dedicated Project Manager
Robust Data Security

Free Pilot Project

Blog

Platform

Open Source

An all-in-one open-source data labeling platform for multimodal training data.

Computer Vision for Smart Vending Machines: How It Works and What Data You Need

What is a smart vending machine?

How does vending machine vision detect which products were taken?

What computer vision models are used in smart vending machines?

What data is needed to train a vending machine vision system?

What types of annotation are required for retail product recognition?

What datasets are available for training retail product recognition models?

How to build a training dataset for vending machine AI vision quickly?

FAQs

Get Essential Training Data
for Your AI Model Today.

AI Training Data Solutions & Services

Overview of BasicAI’s professional, efficient and low-cost data annotation services for all types of training data and all industries.

Contact BasicAI to get project estimates and free pilot for your customized data labeling project.

End-to-end image/video annotation services for robust computer vision.

Leading 3D Sensor Fusion annotation services for autonomous systems.

Data labeling services for large language model and Gen AI training.

Get Project Estimates

BasicAI Data Annotation Platform

Overview of BasicAI’s all-in-one smart data annotation platform.

Explore the AI-powered labeling toolset for all types of AI training data.

See how BasicAI facilitates collaborative annotation project.

Learn about annotation tools designed for SFT, RLHF and classification tasks.

Tools for auto point cloud data labeling and semantic segmentation.

Choose the right plan for your teams, no matter how small or large.

Industries & Use Cases

Proprietary Data Engine Prompt Delivery Full Quality Assurance

Competitive Pricing Dedicated Project Manager ​Robust Data Security

Free Pilot Project

Blog

Platform

Open Source

An all-in-one open-source data labeling platform for multimodal training data.

Computer Vision for Smart Vending Machines: How It Works and What Data You Need

What is a smart vending machine?

How does vending machine vision detect which products were taken?

What computer vision models are used in smart vending machines?

What data is needed to train a vending machine vision system?

What types of annotation are required for retail product recognition?

What datasets are available for training retail product recognition models?

How to build a training dataset for vending machine AI vision quickly?

FAQs

Get Essential Training Data for Your AI Model Today.

Proprietary Data Engine
Prompt Delivery
Full Quality Assurance

Competitive Pricing
Dedicated Project Manager
Robust Data Security

Get Essential Training Data
for Your AI Model Today.