A bakery can carry hundreds of items: croissants, bagels, cakes, and seasonal specials...
Cashiers are expected to remember names, prices, and codes for a catalog that changes all the time. Error rates stay high.
Barcodes seem like a solution. But many baked goods need to be bagged and labeled before checkout. That adds labor and time.
This is the problem that led to BakeryScan, a computer vision system from Japan. It identifies items placed on a checkout tray and totals the bill in about one second. As it processes more transactions and encounters new product variants, its accuracy keeps improving.

Vision-based scanless checkout is spreading fast in the retail industry. AI is removing friction from shopping.
In this blog post, we'll discuss how smart checkout works, and what data and annotations are needed to train these systems.
What is vision-based scanless checkout, and what problems does it solve?
In a standard checkout flow, customers place items on a belt or counter. A cashier (or self-checkout) scans barcodes, then the customer pays. Since barcode scanning arrived in the 1970s, the flow has barely changed.
Wait time affects customer satisfaction and purchase intent. Longer waits mean lower return rates and more abandoned carts.
Frictionless checkout eliminates manual item scanning. That shortens checkout time and reduces labor. In practice, it is usually built with RFID or computer vision.
RFID uses small electronic tags attached to products for wireless identification and tracking. But you can't stick these tags on baked goods, fresh produce, or other bulk items.
Computer vision does not require product modification. Packaged or unpackaged items can be identified by appearance alone and integrated seamlessly with payment and inventory management systems. It reduces transaction time from 3-5 minutes to as little as 5-30 seconds
Vision-based scanless checkout takes several forms:
Vision checkout terminal: recognizes items placed in a basket or tray and calculates the total.
AI smart scale: weighs items and combines weight with visual recognition to price goods. This is useful for fresh produce.
Smart cart: tracks what is placed into the cart and shows a running total in real time.
“Just Walk Out” store: deploys extensive camera networks throughout the store, tracking every product interaction and charging customers automatically when they leave. No checkout counter is needed.

How do vision-based smart checkout systems recognize items?
A vision checkout system turns images into a price through a multi-stage neural pipeline. Early systems used hand-crafted color and edge features. Modern systems run primarily on convolutional neural networks and vision transformers.
During image processing, systems use LED arrays to control lighting strictly and eliminate shadow interference. The captured image then goes through steps such as white balance, resizing, and contrast normalization.
In the object localization stage, the system first separates products (foreground) from the tray/counter (background), then identifies each item. When items touch or overlap, instance segmentation predicts pixel-level masks for each object to enable accurate counting.
Feature extraction generates a high-dimensional vector encoding each object's visual attributes. This process must balance capturing subtle differences while ignoring distractions like lighting variations.
Classification matches extracted vectors against a known product database through similarity search. Downstream logic then computes the total and updates the transaction.
In production, teams often run in one of two modes:
Authoritative prediction: The strategy is adopted by “Just walk out” store. The system decides the item identity and price without human verification. This demands very high accuracy and broad training coverage.
Assisted prediction: The system proposes a result and staff confirm it. Accuracy and coverage requirements are looser, and it is a practical way to handle long-tail items. For example, in a bakery with BakeryScan, bakery staff can override or correct the system's suggestions if needed.
What object detection models work well for retail product recognition?
Retail baskets and shelves are crowded. Many products look alike. Several mainstream architectures suit different situations.
The YOLO family is a common baseline for real-time deployment. One forward pass predicts boxes and classes. Inference can be under 20ms in optimized setups.
Among the family, YOLOv5 can weigh only 27MB (depending on variant/config) and fit edge hardware. Retail-tuned versions may add attention modules to focus on discriminative regions.

Two-stage detectors like Faster R-CNN and Mask R-CNN prioritize accuracy. They first generate region proposals, then classify each one. This naturally reduces false positives and improves localization precision. They often perform well on small objects and large scale variation.
Mask R-CNN adds pixel-level masks. This is valuable when products touch. A cluster of pastries that looks stuck together can still be separated for correct counting.
Deep residual networks like ResNet typically serve as backbone feature extractors embedded in these architectures. Skip connections help optimization in deep nets and support multi-level features from edges and textures to higher-level shapes.
Vision transformers and vision-language models like CLIP represent the latest frontier. These models support strong transfer and, in some setups, zero-shot recognition from text descriptions. They can reduce cold-start pain when new products arrive with little or no labeled data. Compute cost is still a constraint in many checkout deployments.
In real systems, hybrid strategies are common. YOLO for fast primary detection, with Faster R-CNN used to verify uncertain cases. Or they train specialized models for different categories to balance speed and accuracy.
What type of training data is needed for smart checkout systems?
Model performance is constrained by dataset quality and coverage. Data needs differ by product form factor.
Visual checkout terminals need to capture how items actually look on trays under varying lighting and angles. Studio photos alone are not enough.
Smart carts require in-store capture. Different aisles have different lighting. Items are often occluded by the cart frame or other products.
Just Walk Out stores require extensive overhead images from ceiling cameras. Products may be blocked by customer bodies or captured from extreme angles.

Produce and bakery add another layer. Packaged items can be visually consistent. Apples vary by size, color, and ripeness. Bread varies in crust, shape, and surface texture. Training data must reflect this natural variation.
Dataset scale is the hard part. A small grocery store may have 10,000 SKUs. A large supermarket can have 30,000–50,000. Ideally, each product needs training samples.
Amazon reportedly photographed hundreds of thousands of products for Just Walk Out work, building datasets with millions of images. BakeryScan focuses on baked goods, which allows high accuracy with a smaller, narrower dataset.
A dataset tuned to a target environment often beats a universal dataset that tries to cover everything.
The long tail is a persistent pain point. Roughly 80% of retail revenue often comes from 20% of products. The remaining items include local brands and seasonal specials. Each appears rarely, but together they account for meaningful volume.
A common approach is a tiered plan. High-volume products get hundreds to thousands of images. Medium-volume products get tens to hundreds. Long-tail products get only baseline data and run in assisted mode with staff verification.
Transfer learning is also widely used, so features learned from high-volume products help low-volume ones with limited extra data.
What annotation types are required for training checkout vision systems?
Bounding box annotation remain the default. Annotators draw a rectangle around each object. This trains detectors like YOLO and Faster R-CNN and supports counting in crowded scenes.
High-quality boxes should be tight to the object’s outer pixels. Loose boxes shift IoU metrics and can teach the model the wrong spatial cues.
Elongated or rotated objects are a special case. A pen or thin bottle placed diagonally forces an axis-aligned box to include a lot of background. That can confuse training. In these cases, rotated bounding boxes can be a better annotation type.
Image segmentation is more precise. Each object gets a pixel mask instead of a rectangle. It helps with overlap and contact. Annotation effort is typically 3–5× greater than bounding boxes.
In practice, instance segmentation is used for high-volume products or categories that frequently overlap. Low-volume items or those that typically appear alone use only bounding boxes to control costs.
Semantic segmentation assigns each pixel a class (product, background, hand, tray, cart frame, etc.) but does not separate multiple instances of the same class. It is mainly used for scene context. It can improve robustness by helping the system ignore hands, equipment, and structural parts of the checkout setup.
Detected items must be linked to specific SKUs in the retail inventory database.
Packaged goods can be matched automatically through text recognition. But fresh produce or items without clear SKU markings require manual annotation.
This calls for a hierarchical label scheme (parent/child class or attribute-based labels). This work requires annotators familiar with retail inventory systems.
What public datasets exist for retail product recognition?
Researchers and companies have created several public datasets to support retail product recognition system development. But none fully covers real retail inventory breadth.
RPC: built for automatic checkout. Over 83,000 images across 200 product categories, with both studio-like and cluttered real-world scenes.
SKU-100K: designed for dense shelf detection. 11,762 images, with 150 objects per image on average, and some images reaching 700.
Grocery Store Dataset: 5,125 natural images captured in real supermarkets using smartphones, covering 81 categories (fruits, vegetables, and packaged items like juice, milk, yogurt).
MVTec D2S: pixel-level labels for instance-aware semantic segmentation tasks. 21,000 images across 60 categories.

Compared to comprehensive general retail datasets, publicly available specialized category datasets remain limited. For models handling baked goods, produce, or specialty merchandise, companies must consider building their own datasets and maintaining them as proprietary assets.
How can training datasets for scanless checkout computer vision systems be built effectively?
BakeryScan recognizes baked goods at 98% accuracy. Amazon's Just Walk Out processes millions of transactions. Mashgin identifies over 100,000 items in seconds. The foundation for all of them is data. The few seconds at the register are backed by millions of labeled images.
To build effective training datasets, we would like to provide several practical recommendations based on experience:
Decide what to cover first. Rather than trying to cover an entire supermarket inventory immediately, focus first on high-volume items. These account for about 80% of checkout volume. Validate the pipeline, then expand coverage.
Define capture specs from the deployment scene. Collect images that match the camera angle, lighting, and background the system will see. Combine in-store photography with controlled studio shooting, so you get both realism and coverage.
Run strict QA and close the loop with production errors. Track failure modes in the field. Add targeted data where the model breaks, and keep updating.
Dense retail annotation is cognitively expensive. Drawing 150 tight boxes on a single shelf image can take an hour. Small mistakes can teach the AI model to ignore valid objects, driving false negatives.
Many AI teams work with specialized data annotation teams for this reason. BasicAI is one example that can handle the complex retail-style data. They emphasize trained annotators (rather than anonymous crowdsourcing) and structured QA workflows. They support multiple label types such as bounding boxes, image segmentation, keypoints, and rotated boxes.
For teams with in-house labeling, BasicAI also offers on-prem smart data annotation platforms with semi-automatic labeling. A model proposes initial labels, then humans verify and correct them. This can increase throughput while keeping label accuracy under control.
Annotation cost depends on label type and precision requirements. Bounding boxes are usually cheaper than semantic or instance segmentation. Large programs may get volume pricing.
When planning smart checkout system development, companies should budget annotation as a significant expense and plan carefully.





