Case Studies

Autonomous Delivery Robots: Concepts, Evolution, and Technologies Behind

This blog post covers the evolution, technology, and applications of delivery robots, with a focus on the perception algorithms and data.

min

Admon W.

In July 2025, Shenzhen—a major city in southern China—launched a fleet of 41 autonomous delivery robots for the world's first metro-based retail distribution project.

These penguin‑like robots board subway trains during off‑peak hours, plan their routes on their own, get off at stations with 7‑Eleven stores, and deliver straight into the store.

“In the past, delivery workers had to park above ground, unload, then push goods into the subway station. Now with robots, delivery is much easier and more convenient,” said a 7‑Eleven store manager in the pilot.

Using panoramic LiDAR and an intelligent dispatch system, the robots plan optimal paths and deliver efficiently. This hints at how embodied intelligence is moving from factory floors into daily life.

Metro-based Delivery Robots Retail Distribution Project

Robots are taking on dangerous, monotonous, or specialized tasks. Delivery robots are a clear example, and a pioneering smart city application.

This blog post covers the evolution, technology, and applications of delivery robots, with a focus on the perception algorithms and data behind them.

What is Autonomous Delivery Robotics?

An autonomous delivery robot is a mobile robotic system that completes delivery tasks without human control. It carries sensors, navigation systems, and AI models to perceive environments, plan paths, and transport packages, meals, or groceries from origin to destination on sidewalks, campus roads, or indoor spaces.

They typically operate at low speeds, avoid obstacles, yield interactively, dock at precise points, manage lockers or hatches, verify identity and orders, and connect to a cloud dispatch platform.

Delivery robots come in various forms depending on their application scenarios—from four‑ or six‑wheel mini‑vans to compact two-wheeled self-balancing platforms.

In industry terms, these are service robots on ground, i.e., unmanned ground vehicles (UGV). They are autonomous mobile robots (AMR) with perception-localization-decision-control loops (like self-driving cars) while accommodating human interaction preferences.

Unlike industrial robots that execute preset tasks in fixed environments, delivery robots operate in open, dynamic, and unstructured environments. They need stronger perception and more flexible decision‑making.

How Delivery Robots Have Evolved

Interestingly, as perception technology advances, autonomous delivery robots continue to expand in application scenarios and scale.

Roll‑out moved from controlled indoor spaces (hospitals, hotels, office buildings), to sidewalks and campuses, and then to low‑speed open roads. Each step relied on richer visual semantics, steadier geometry, and lower sensor cost.

Starting with Indoor Delivery

In 2014, robotics startup Savioke released Relay, a hotel room service robot. Staff loaded items, and the robot delivered to rooms.

At this stage, delivery robots focused on hospitals and hotels. The mainstream stack used 2D LiDAR with depth cameras and IMU. The priority was robust near‑field obstacle detection and corridor/elevator geometry recognition.

SLAM relied on laser or visual-inertial methods, with QR codes or AprilTags for precise docking. Algorithms were mostly rule‑based with little semantic reasoning, given neat layouts and simple participants.

Scale was limited, but reliability was high. The core loop for navigation and human-robot interaction (HRI) was validated.

Relay Hotel Room Service Robot by Savioke

Rise of Sidewalk and Campus Delivery

In 2018, Starship expanded in universities and neighborhoods across Europe and the U.S. The typical configuration used surround-view RGB cameras as primary sensors, with short-range ultrasonic or low-channel LiDAR for redundancy. With weak GNSS, they relied on light maps and loop closure.

Perception shifted from handcrafted features to CNN detection and semantic segmentation for pedestrians, cyclists, curbs, and traversable areas. Tracking leaned on Kalman/SORT/DeepSORT.

Scenes stayed tightly geo‑fenced, but perception systems began handling dynamic interactions on crowded sidewalks.

From 2020–2022, COVID‑19 drove contactless delivery. Kiwibot, Coco, and others expanded deployment in the U.S.

During this period, lightweight deep models matured. MobileNet and EfficientNet made richer perception feasible on edge devices. Synthetic data and simulation cut data collection costs.

Sensor Fusion and Low-Speed Open Road Trials

With falling solid‑state LiDAR prices, camera‑first delivery robots kept one or two low‑line LiDARs or short‑range radars to handle rain, snow, and backlight.

BEV perception and learned occupancy grids reached production. Curbs, crosswalks, and obstacles moved from pixel labels to occupancy and traversability estimates, improving robustness without lane markings and through construction zones.

Trajectory prediction upgraded from heuristic yielding to multimodal networks, reducing freezing behavior. At that time, Nuro ran low‑speed unmanned freight tests in the U.S., though V2X and vehicle-infrastructure cooperation remained in small-scale trials.

Deployments reached city‑scale, multi‑site footprints, but constrained to fixed corridors and time windows.

Where We Are, and What’s Next

Around 2024, Starship completed millions of orders across communities and campuses. In other countries such as China, campus and community delivery spread to more cities, with open‑road delivery expanding on short, fixed routes.

On the tech side, lightweight networks and BEV stacks, via model compression and hardware acceleration, are now feasible on edge devices. Camera‑only covers most daytime cases.

In adverse weather, traditional mmWave radar and thermal cameras are practical add‑ons. 4D imaging radar is still in cost exploration. Self‑supervised depth, temporal stereo, and sparse‑LiDAR alignment reduce heavy‑map reliance. Multi‑task distillation fuses object detection, occupancy, and prediction into one model.

In the next few years, the theme is stronger semantics with fewer sensors. End‑to‑end BEV and world‑model methods will couple detection, geometry, and short‑term planning, improving long‑tail generalization and behavior consistency.

Indoor and campus will keep steady growth. Sidewalk delivery will gain network effects in regulation‑friendly cities. Low‑speed open‑road delivery will keep expanding.

In short: stronger visual semantics, fewer sensors, lighter maps, and deployments that can be replicated city‑wide.

Applications of Delivery Robots

The development path above reveals typical delivery robot applications: indoor delivery, community delivery, and sidewalk delivery. This section examines how delivery robots function in different scenarios.

Hospital and Hotel Delivery

Hospital logistics is one of the most mature cases. Delivery robots move medicines, specimens, blood, and supplies across pharmacy, labs, and operating rooms.

In Israel, Sheba Medical Center uses robots to deliver chemotherapy medicine from pharmacy to nurses, reducing wait time.

The robot developed by the Israeli startup Seamless Vision (Sheba Medical Center)

Hotels and office towers have similar “room/office drop” tasks: food, luggage, and small items from the front desk to doors.

Both applications occur in lobbies, corridors, elevators, and passages—indoor environments with clear geometric structures. Core skills are robust human and obstacle detection, narrow-space yielding, and coordination with access control and elevators.

Perception primarily uses 2D/3D LiDAR and depth cameras. Vision cameras aids doorplate recognition and traversable area segmentation. Short‑range radar or low‑line LiDAR improves robustness under backlight and glass reflections.

Campus Sidewalk Food Delivery

This is a prominent case: food, coffee, and snacks from dining halls or merchants to dorms and academic buildings. The environment include shared paths, crosswalks, and sloped curbs used by pedestrians and non‑motorized vehicles.

Perception stacks usually use surround RGB cameras as primary, with segmentation and detection for pedestrians, cyclists, pets, and curbs. Occupancy and BEV support traversability. With unreliable GNSS, they rely on light semantic maps and loop closure.

George Mason University was the first campus to make on‑demand robotic delivery part of a meal program, with a 25‑robot fleet from Starship.

Food Delivery Robots in George Mason University

Limits include special delivery requests (e.g., leaving at a door) and rough terrain. Remote operators may assist to bypass obstacles.

Smart Retail Grocery Delivery

Grocery delivery has high frequency, temperature control, and tight timing. The task involves transporting fresh produce, frozen goods, and daily necessities from supermarkets or urban micro-fulfillment centers to residences or community pickup points.

Perception in the case must be stable across repeat routes and all weather. Camera‑led traversability estimation identifies curbs, puddles, leaves, and uncleared snow. At night and in rain/snow, short‑range radar or 4D radar fills gaps. Routes are often fixed store‑to‑community corridors. Maps are light and updated online to handle temporary barriers and parked cars.

In early 2022, Nourish + Bloom became the world’s first Black‑owned autonomous grocery store. It uses computer vision with AI voice and gesture for checkout. Delivery runs via Daxbot robots, up to 4 mph for 10 miles, with temperature‑controlled cargo.

Package Delivery

In January 2019, Amazon launched an experimental service with Amazon Scout to deliver small parcels to Prime customers.

In package delivery, robots travel from final distribution centers to pickup points or residential doorsteps. Perception emphasizes recognition and docking: robust detection and alignment for door access systems, locker doors, doorplates, and landmarks; traversability on steps, ramps, and tactile paving; and safe yielding for long‑tail interactions such as strollers and dog walking.

A Potential Application: Nighttime Inventory Replenishment

Pandemic‑era night operations showed the value of short‑haul replenishment at night. From 10 pm to 6 am, robots shuttle standard totes among backrooms, community warehouses, and metro pickup points, pre‑positioning inventory for the morning peak in instant retail and parcel delivery.

The scene has low traffic, low light, wet surfaces (sometimes), cleaning vehicles, and temporary closures at some intersections. Perception must stay stable under low light and glare, and distinguish standing water, black ice, leaf piles, and cones. Low‑light and bad‑weather robustness is the key to scaling this use case.

Perception Technology and Data Behind Delivery Robotics

Perception Technology in Delivery Robots

In the mainstream, delivery robots often take camera first and multimodal as backup.

For sidewalks and campuses, surround RGB is primary for detection, segmentation, and traversability. In rain, snow, and at night, short‑range mmWave or 4D imaging radar adds redundancy. When cost is tight, teams keep only one or two low‑line solid‑state LiDARs for geometric degradation.

Low‑speed open‑road robots add more LiDAR/radar redundancy and RTK‑GNSS. Indoor robots primarily use 2D LiDAR + depth cameras + VIO.

Key algorithms include deep learning-based object detection (pedestrians, vehicles, obstacles), semantic segmentation (traversable areas, sidewalk, grass), BEV occupancy prediction, lightweight SLAM (visual or laser), and multi‑object tracking. Compute limits force heavy compression, e.g., MobileNet, YOLO‑Nano.

How Delivery Robots Differ From Autonomous Cars

Their methods are closely related. Both need perception, localization, and planning. Both employ multi-sensor fusion and target real‑time safety under long‑tail events, with similar module graphs (detection, tracking, prediction, planning).

However, delivery robots are slower (5–10 km/h vs. 30–120 km/h), with looser reaction time. Interactions are near‑field on non‑structured sidewalks and campuses, requiring finer traversability over curbs, narrow passages, and temporary static obstacles, with more focus on pedestrians than vehicles. Lower camera height worsens occlusion. Cost and power budgets limit high‑line LiDAR and push aggressive model compression.

Passenger AVs need longer range and high‑speed robustness, favor heavy sensor redundancy (high‑line LiDAR, long‑range radar), more compute, and more formal map/traffic semantics. Cost and power are very different: AVs pursue very low disengagement with higher redundancy. On data, AVs benefit from mature lane/signal/regulatory Ontologies; delivery robots lack a unified sidewalk Ontology, so cross‑city generalization leans on online occupancy and weakly supervised semantic alignment.

Algorithm Training and Data for Delivery Robots

Training these algorithms requires extensive data collection and annotation.

Primary data comes from on‑robot, long‑term collection across cities, seasons, day/night, rain/snow/fog, holidays with dense crowds, and long‑tail interactions (dog leashes, strollers cutting in, randomly placed cones, snow piles, puddles, etc.).

To fill gaps, teams often use simulation, synthetic weather, domain randomization, and procedural crowds, or use high‑cost multi‑sensor “teachers” to produce pseudo‑labels (occupancy, depth, semantics) for camera‑only student models.

Joint multi‑task training often optimizes detection, segmentation, BEV occupancy, vectorized curbs, trajectory prediction, and traversability scores. Long tails are eased by resampling/reweighting and hard‑example mining.

Similar to self-driving cars, delivery robots involve rich data modalities: images, point clouds, radar data, or multi-modal fusion data. For data annotation, we've identified several key challenges:

Ambiguous sidewalk semantic boundaries (legally walkable vs. physically traversable).
Centimeter‑level curb and ramp geometry.
Instance continuity and cross‑frame IDs under heavy occlusion.
Thin objects such as pet leashes.
Glass/reflective surfaces and night glare.
Low‑visibility “assume‑present” agents (like children).
Docking states with access control, lockers, and elevators.

High-quality ground truth for occupancy/depth proves expensive. Teams often build weak truth via multi‑view fusion and offline SLAM. Privacy compliance needs face/license‑plate blurring and consistent Ontology across cities.

Finding the Right Data Annotation Service Provider for Delivery Robots

Label quality sets the ceiling for perception. Robots work close to people in dynamic, unstructured spaces. Training sets must span cities, weather, day/night, and long‑tail interactions, with fine, precise semantics.

If the Ontology is vague, time and space aren’t aligned, or QC is weak, BEV occupancy, tracking, and prediction will be biased, and errors will surface in planning and behavior. A clean ontology, reliable weak truth (such as occupancy/depth pseudo-labels from offline multi-sensor fusion), and stable QA processes significantly reduce data loop iteration costs and risks.

When selecting annotation tools and services, prioritize validation of native support for multi-sensor and temporal data: support for camera-LiDAR synchronization and extrinsic/intrinsic parameter management; ability to complete 2D/3D bounding boxes, segmentation, and cross-frame ID tracking within the same workflow; efficient interpolation and conflict resolution for long sequences.

A strong data annotation platform offers configurable multi‑level Ontologies, human‑in‑the‑loop with auto‑label, and privacy/security controls.

BasicAI Data Annotation Platform & Services for Delivery Robots

Data labeling service capability matters too: steady delivery cadence, peak throughput, sidewalk/campus domain expertise, clear quality metrics, and explainable error analysis.

Focus on total cost of ownership, not only unit price. Tool efficiency, automation ratio, and rework rate decide true per‑sample cost and time to launch.

As an experienced provider, BasicAI aligns well with these needs. The proprietary platform is built for multi‑sensor fusion and supports efficient, aligned 2D&3D fusion labeling, with industry‑leading point‑cloud tooling and auto‑label to boost throughput and consistency. Services are delivered by an in‑house team with QC processes targeting 99%+ accuracy. Combined with efficient project management and tools, the total price is highly competitive.

BasicAI has served many Fortune 500 companies and leading AI teams, with large‑scale deliveries across industries and scenarios (esp. autonomous driving). Mature processes and engineering capacity are crucial for the continuous data flywheel and rapid iteration in delivery robotics.

Data Annotation Platform & Services for Delivery Robots

Back to All Posts

Get Essential Training Data
for Your AI Model Today.

Let's Talk

AI Training Data Solutions & Services

Overview of BasicAI’s professional, efficient and low-cost data annotation services for all types of training data and all industries.

Contact BasicAI to get project estimates and free pilot for your customized data labeling project.

End-to-end image/video annotation services for robust computer vision.

Leading 3D Sensor Fusion annotation services for autonomous systems.

Data labeling services for large language model and Gen AI training.

Get Project Estimates

BasicAI Data Annotation Platform

Overview of BasicAI’s all-in-one smart data annotation platform.

Explore the AI-powered labeling toolset for all types of AI training data.

See how BasicAI facilitates collaborative annotation project.

Learn about annotation tools designed for SFT, RLHF and classification tasks.

Tools for auto point cloud data labeling and semantic segmentation.

Choose the right plan for your teams, no matter how small or large.

Industries & Use Cases

Proprietary Data Engine Prompt Delivery Full Quality Assurance

Competitive Pricing Dedicated Project Manager ​Robust Data Security

Free Pilot Project

Blog

Platform

Open Source

An all-in-one open-source data labeling platform for multimodal training data.