Annotate Smarter

3D Cuboid Annotation for Computer Vision Model Training

What is 3D cuboid for LiDAR and sensor fusion data? Learn the formats, applications, and tips for accurate cuboid annotation in point clouds

min

Admon W.

Early perception systems relied on images and 2D bounding boxes. A 2D box is a clean way to outline an object within a pixel grid. But the world is three-dimensional. If you care about size, depth, orientation, and spatial relations, you need labels that describe geometry in real space.

Modern 3D sensors introduced new data modalities. LiDAR measures distance by emitting laser pulses and timing their return, producing 3D point clouds. This data demands new data annotation methods.

3D cuboids (3D bounding boxes) emerged as a practical solution. They carry the key signals autonomous systems need for safety-critical decisions like collision avoidance, path planning, and control, while staying simple enough to label at scale.

Major benchmarks such as KITTI, nuScenes, and Waymo Open Dataset standardized on 3D cuboid annotations. Their conventions shaped much of today’s industry norms.

In this post, we cover the concept, applications, how it’s represented, and workflow of 3D cuboid annotation.

What is 3D cuboid annotation in computer vision?

A 3D cuboid (3D bounding box) is a rectangular box in 3D space that tightly encloses a target object.

Unlike a 2D bounding box, a cuboid captures:

Position in 3D Euclidean space,
Physical dimensions along three axes, and
Orientation relative to a defined coordinate frame.

This volumetric representation changes how ML models perceive and reason about the physical world.

3D cuboid annotation is especially important for LiDAR point cloud data. LiDAR is the primary sensing modality in current autonomous driving systems. It generates sparse but highly accurate 3D point clouds, millions of 3D coordinates representing laser pulse returns from surfaces in the environment. In point clouds, a 3D cuboid defines which points belong to an object and which do not.

Autonomous vehicles and many robots combine multiple sensing modalities (LiDAR, cameras, radar, and others) through sensor fusion. Each provides different signals, and each has its own coordinate frame. A 3D cuboid provides a common 3D reference that 2D camera observations can be projected onto and aligned with.

What CV tasks and applications benefit most from 3D cuboid annotations?

Core computer vision tasks

3D object detection is the primary and most fundamental task built on 3D cuboid data. Instead of predicting a 2D pixel region, the model predicts an object’s real-world 3D position, metric size, and rotation.

Trajectory prediction and 3D object tracking. Tracking requires consistent object identity across frames and an estimate of motion over time. Cuboids provide ground-truth geometry in real space. Datasets such as Argoverse are common benchmarks here. Trained systems can predict trajectories, match detections across frames using spatial proximity and motion consistency, and handle partial occlusion or new objects entering the scene.

Autonomous driving and ADAS

In autonomous driving, 3D object detection is the foundation for nearly all downstream 3D perception tasks, including motion prediction, trajectory planning, and collision avoidance.

Cuboids provide the physical footprint of nearby vehicles and other agents and their exact distance to the ego vehicle. That supports safe following distance and correct time-to-collision (TTC) calculations for emergency braking. By analyzing the heading angle (yaw) of vehicle or pedestrian cuboids, the system can also infer motion intent.

Robot navigation and obstacle avoidance

Warehouse picking robots, AGVs, and humanoid platforms all need 3D perception for planning and interaction.

In warehouses, factories, and service spaces, it’s not enough to know where an object appears in an image. The robot needs its 3D position and orientation to plan a collision-free path and execute actions.

For manipulation, a robot arm needs accurate 3D localization and pose to grasp from shelves. For mobile robots, accurate 3D size and location for obstacles, furniture, and people is required for safe and natural navigation.

Augmented reality (AR)

AR and VR depend on consistent alignment between virtual content and the physical world.

3D cuboid annotations map the precise physical dimensions and spatial positions of real-world items in a user’s environment. That supports correct occlusion (virtual content behind real furniture), stable placement on surfaces, and consistent depth behavior as the user moves.

How are 3D bounding boxes represented?

To make a 3D cuboid machine-readable, teams follow a standardized mathematical format. The industry-standard 3D bounding box is defined by 9 parameters. These fully specify position, size, and rotation.

The 9 parameters

Center coordinates: the x, y, z coordinates of the cuboid’s center in a relevant coordinate frame. In autonomous driving, this is often an ego-vehicle frame with origin on the ground under the vehicle center, with x pointing forward, y pointing left, and z pointing up.
Dimensions: the physical size of the cuboid: length, width, and height (l, w, h). These values describe the box’s extent along its three principal axes, in the same units as the coordinate frame (usually meters in autonomous driving).
Orientation / rotation: the cuboid’s rotation around the x, y, and z axes, commonly called roll (α), pitch (φ), and yaw (θ). These three angles fully define any 3D rotation.
- Yaw (rotation around the vertical z-axis) is critical in autonomous driving. It directly determines a vehicle’s heading, whether a pedestrian faces toward or away from the ego vehicle, and which side of an object faces which direction.
- Pitch (rotation around the lateral y-axis) describes forward or backward tilt, relevant for vehicles on ramps or slopes.
- Roll (rotation around the longitudinal x-axis) describes lateral tilt, relevant for leaning motorcycles.

In many road-driving datasets (including the standard KITTI dataset), objects on relatively flat ground are often assumed to have near-zero pitch and roll. That reduces the parameter set from 9 to 7 for efficiency.

Different coordinate frames

Cuboid parameters depend on the coordinate system:

The sensor frame is defined relative to a single sensor. A LiDAR sensor defines its own 3D frame with the origin at the sensor center and axes aligned with its mounting orientation.
Camera frame places its origin at the camera’s focal point with axes aligned to the optical axis.
Ego-vehicle frame places its origin at the vehicle center with axes aligned to the vehicle heading.
Global/world frame references fixed earth coordinates, typically using a latitude-longitude-altitude system or a local Cartesian projection.

In sensor fusion pipelines, LiDAR points live in the LiDAR frame, while images live in the camera frame. To align them, you project between frames using calibration:

Apply extrinsics to transform world (or LiDAR) coordinates into the camera frame;
Apply intrinsics to project camera coordinates onto the 2D image plane.

How to perform 3D cuboid annotation on BasicAI Data Annotation Platform

With the necessary concepts covered, let’s turn to practice. Production-scale 3D cuboid labeling needs tools that can render million-point clouds and synchronized high-res images with low latency.

Below is a typical workflow using the BasicAI Data Annotation Platform as an example for LiDAR-camera fusion. We assume that raw data collection, multi-sensor calibration, data preprocessing, and dataset organization have already been completed.

Create the dataset and ontology

Before drawing any cuboids, the project manager must define the annotation schema and dataset parameters.

Dataset creation. From the BasicAI homepage, go to Datasets and create a new dataset. Choose LiDAR Fusion as the dataset type to indicate synchronized LiDAR point clouds and camera images. Use a clear dataset name (for example, “Urban Driving Scene LiDAR Fusion Dataset”), then confirm.

Create a dataset on BasicAI Data Annotation Platform

Ontology definition. Open the dataset’s Ontology tab and define classes, attributes, and tools. The ontology specifies which object types exist, which attributes each type has, and which annotation tool is used per type.

For this autonomous driving scene, create a main class such as “car” (passenger cars, SUVs, and similar). Set its tool type to Cuboid. Common attributes for vehicles include:

Occlusion: fully visible / partially occluded / heavily occluded.
Truncation: partly outside sensor range or camera view.
Motion state: parked / moving / stopped.

Create ontologies on BasicAI Data Annotation Platform

Manual 3D cuboid annotation

Open the annotation UI. In the Data tab, select a LiDAR fusion frame or a sequence, then click Annotate. The UI is typically split into a left tool panel, a central 3D+2D view, and a right ontology/attribute panel.

3D cuboid annotation on BasicAI Platform

Select the cuboid tool. Choose the 3D cuboid tool (hotkey “1” or “F”).

Two-click cuboid creation. BasicAI uses an assisted cuboid creation flow. Click twice in the point cloud to set an initial extent: first click one outer corner of the target cluster, second click the diagonal corner. The system proposes an initial 3D cuboid. The direction from the first click to the second influences the initial yaw.

2-click cuboid creation on BasicAI platform

Multi-sensor projection and linking. Because the dataset is LiDAR fusion, the tool uses preloaded extrinsics and intrinsics to project the 3D cuboid onto the aligned camera image in real time. It also creates a 2D bounding box and a 2D cuboid in the image view. These instances share the same trackID and trackName, so downstream models can treat them as one physical object across modalities.

Refine the cuboid. Adjust position, size, and rotation in the 3D view. Resize by dragging faces, rotate with the rotation controls, and translate the whole box. Here we need to make it include all points that belong to the object, excluding nearby objects, infrastructure, and ground returns.

Set attributes. On the floating panel, pick the “car” class and fill in occlusion and truncation based on visual evidence.

Optional: Auto-annotation (pre-labeling). For efficiency, annotators can run a built-in pretrained model. On BasicAI platform, this can be triggered via the “brain” button for one-click inference. The model proposes cuboids for vehicles and other classes across the scene. The human role shifts from drawing to verification and correction.

Save, exit, and export. After labeling all target objects in the frame (or across a 4D-BEV sequence), click Save, then close the UI. A manager can select completed items in the Data tab and run Export. Outputs are commonly produced in formats such as JSON or KITTI-style files.

Practical tips for 3D cuboid annotation

3D cuboid labeling is labor-heavy and easy to get wrong. Based on real project experience, here are some strategies to maintain high efficiency and quality.

Ensure tight fit. Avoid large empty margins, and do not cut off valid points. Rotate the 3D view often and inspect from multiple angles before finalizing.
Align the bottom face to the ground. Most road and warehouse objects rest on the ground plane. Many tools like BasicAI can run ground segmentation to separate ground and non-ground points. Another simple tactic is to restrict the visible height range. For example, only showing points from 0.5-5m above ground to reduce clutter.
Calibrate sensor alignment when using LiDAR fusion data. Visual mismatches between 3D point clouds and 2D camera images are common. BasicAI provides online calibration tools (the camera + gear button on the dataset page). Place a reference point on a clearly identifiable physical feature in the point cloud, then adjust its 2D projection in the corresponding camera image until the coordinate frames align.
Human-model coupling. Advanced platforms like BasicAI include pre-trained ML models that detect common classes and propose cuboids. For sequences, auto tracking features can propagate cuboids across frames and keep stable object IDs.
Consider partnering with professional annotation providers. 3D cuboid annotation demands significant cognitive load, substantial hardware resources, and specialized expertise. Leading AI teams frequently work with dedicated annotation companies that bring robust multi-tier QA frameworks and purpose-built infrastructure to deliver high-quality, consistent 3D annotations at industrial scale. This directly supports safer, more reliable AI models in production.

Back to All Posts

Get Essential Training Data
for Your AI Model Today.

Let's Talk

AI Training Data Solutions & Services

Overview of BasicAI’s professional, efficient and low-cost data annotation services for all types of training data and all industries.

Contact BasicAI to get project estimates and free pilot for your customized data labeling project.

End-to-end image/video annotation services for robust computer vision.

Leading 3D Sensor Fusion annotation services for autonomous systems.

Data labeling services for large language model and Gen AI training.

Get Project Estimates

BasicAI Data Annotation Platform

Overview of BasicAI’s all-in-one smart data annotation platform.

Explore the AI-powered labeling toolset for all types of AI training data.

See how BasicAI facilitates collaborative annotation project.

Learn about annotation tools designed for SFT, RLHF and classification tasks.

Tools for auto point cloud data labeling and semantic segmentation.

Choose the right plan for your teams, no matter how small or large.

Industries & Use Cases

Proprietary Data Engine
Prompt Delivery
Full Quality Assurance

Competitive Pricing
Dedicated Project Manager
Robust Data Security

Free Pilot Project

Blog

Resources

Open Source

Platform

3D Cuboid Annotation for Computer Vision Model Training

What is 3D cuboid annotation in computer vision?