Since the advent of computer vision (CV), scientists have envisioned a future where machines can perceive and comprehend the world as humans do – in three dimensions. While computer vision still largely relies on two-dimensional images, three-dimensional capabilities hold the key to true environmental understanding. SLAM (Simultaneous Localization and Mapping) technology, which constructs 3D maps, drives rapid advancements in intelligent robots, autonomous vehicles, augmented reality, and other artificial intelligence (AI) applications.
Today, let's go through five basics of SLAM: its definition, categories, framework, applications, and trends.
1. What is Simultaneous Localization and Mapping (SLAM)？
Autonomous mobility requires that a robot know both its location and surroundings – answering “Where am I?” and “What is around me?” Traditional methods involve placing visual markers in the environment, using wireless positioning, or equipping robots with GPS.
But what if such external positioning is unavailable or inadequate?
That's where SLAM technology comes in - SLAM enables a robot to simultaneously map unfamiliar environments and pinpoint its location within them using its sensors. It both estimates the robot’s changing position and develops environmental models solely from sensor inputs. SLAM hinges heavily on sensor technologies – systems either employ LiDAR sensors or visual cameras.
2. Is LiDAR SLAM always better than Visual SLAM?
LiDAR SLAM utilizes a 2D or 3D LiDAR as an external sensor to obtain map data for simultaneous localization and mapping by robots. Visual SLAM relies on cameras to capture a large amount of richly redundant visual information from the environment, providing extremely robust scene recognition capabilities. Currently, 3D visual SLAM is becoming more prevalent, with common visual sensors including monocular cameras, stereo camera systems, RGB-D cameras, and more.
There are three main categories of visual SLAM based on how they work: monocular cameras, stereo camera systems, and RGB-D cameras. Monocular systems have only one camera, stereo systems have two cameras, while RGB-D cameras have a more complex principle — in addition to capturing color images, they can also measure the distance between each pixel and the camera. The working principle is different from ordinary cameras. There are also some special or emerging visual SLAM sensor types, like panoramic cameras and event cameras.
Using a single camera for SLAM is called monocular visual SLAM. The images captured are 2D projections of the 3D environment. To recover the 3D structure of the environment, the camera must be moved to estimate its motion. By analyzing the disparity in object motion, we can obtain relative depth values, but monocular SLAM cannot determine absolute scale from images alone. A single image also cannot directly determine depth. This led to the development of stereo camera systems and depth cameras.
The goal of these new sensors was to actively measure object distance, overcoming the disadvantage of monocular cameras being unable to directly sense depth. Once distance is known, the 3D structure of a scene can be recovered from a single image, while also eliminating scale ambiguity.
Humans use two eyes to see image differences and judge the distance of objects. Binocular cameras work on the same principle, using the distance between two cameras (Baseline) to estimate the spatial position of each pixel. The depth range measurable by binocular cameras is related to the baseline. The larger the baseline, the farther the objects that can be measured. The downside is that the configuration and calibration are relatively complex. Its depth range and accuracy are limited by the baseline and resolution of the binocular camera. Also, disparity calculation consumes a lot of computing resources, requiring the use of GPUs and FPGAs to accelerate before depth information for the entire image can be output in real-time.
Depth cameras, also known as RGB-D cameras, can measure object distance by actively emitting light toward objects and analyzing the reflected light, similar to laser radar sensors. This differs from stereo camera systems that rely on software calculations between two camera views. Instead, depth cameras directly physically measure distances, saving substantial computing resources. However, current RGB-D cameras have limitations like narrow measurement range, high noise levels, small field of view, sunlight interference, and the inability to measure transparent materials. These issues currently restrict most RGB-D camera SLAM applications to indoor environments. Ongoing research aims to mitigate these limitations to expand the capabilities of RGB-D camera SLAM.
2D LiDARs, also known as single-line laser radars, are widely used in warehouse automatic guided vehicles (AGVs), service robots, and cleaning robots. A common structure consists of a paired laser emitter and receiver head along with a motor carrying the sensor with an optical encoder. In operation, the motor rotates at a constant speed while the laser emits probing beams at a fixed frequency. The receiver head then records the angle and timing of returned echoes to calculate object distances.
Unlike visual data, LiDAR data is relatively sparse, containing only position information about the distance to detected reflection points. The planar data obtained by 2D LiDAR can be processed using image analysis methods.
3D LiDARs can be divided into solid-state devices, hybrid units, and mechanical scanners. Solid-state LiDARs resemble cameras, but each “pixel” can actively emit laser light. Hybrid units improve mechanical scanners by minimizing moving parts to reduce measurement errors from vibration and wear. Mechanical 3D LiDARs rotate both horizontally and vertically to scan an area.
By adding the vertical scan dimension compared to 2D LiDAR, voxel maps are commonly used with 3D LiDAR. Applications demanding large-range high-precision positioning and complex motions like autonomous vehicles and robots typically utilize 3D LiDAR.
Whether 2D or 3D, if the LiDAR is not a pure solid-state single scan device, motion distortion can occur because lasers fired at different angles do not emit simultaneously. Methods to minimize this scanning distortion are an active area of development.
LiDAR SLAM vs Visual SLAM
LiDAR SLAM technology is more mature, mainly being used in indoor environments currently. The maps constructed by LiDAR SLAM have higher accuracy with less cumulative error, and can directly enable robotic positioning and navigation. However, these maps lack semantic information, and establishing loop closures can be challenging.
Visual SLAM can operate both indoors and outdoors. System costs are lower, and semantic information can also be extracted from images. However, visual SLAM relies heavily on sufficient scene lighting and texture, failing to function properly in dark environments or texture-less areas. The constructed maps have lower accuracy with some cumulative drift, and cannot directly enable path planning and navigation without further processing.
In summary, LiDAR SLAM and visual SLAM each have distinct strengths and limitations. Sensor fusion approaches that combine laser radar data and visual data are an active area of research to harness the complementary advantages of each. The inevitable trend is towards leveraging hybrid SLAM algorithms capable of synergistically fusing information from both modalities.
3. What does the framework structure of classic SLAM algorithms look like?
The classic visual SLAM framework represents over a decade of research results. This framework itself and the algorithms contained therein are now largely standardized and already provided in many computer vision and robotics libraries. The algorithm consists of five parts:
Sensor Data Acquisition: The system ingests sensor data like camera images or robot odometry readings. Additional preprocessing helps clean and align the incoming data streams.
Front-end Visual Odometry (VO): By matching features between pairs of images, rough estimations of camera motion are calculated. This provides an initial guess for the position optimization done later.
Back-end Nonlinear Optimization: A nonlinear optimizer incorporates visual odometry along with loop closures and other measurements to estimate globally consistent trajectories and maps. Errors from dead-reckoning are eliminated.
Loop Detection: By recognizing a previously visited location, loops are closed in the robot's path. This ties together distinct sections of the map and improves overall consistency. Detecting loops also enables relocalization in kidnapped robot scenarios.
Mapping: Finally, the system fuses optimized data into 3D spatial maps representing landmarks, surfaces, and objects in the environment. These maps continue to develop in detail over time.
The modular structure allows each component to leverage the latest techniques while benefiting from the surrounding pipeline. Together, they enable accurate and robust simultaneous localization and mapping across many domains.
4. What are some typical SLAM application scenarios?
SLAM enables robots to comprehend and navigate environments on their own. Early home implementations were in automated vacuums. Without spatial awareness, these devices would clumsily bump into obstacles. By mapping rooms and localizing against furniture and walls, cleaning routes optimize coverage. Robots can now handle more complex spaces like warehouses and hospitals. As algorithms improve, the potential roles for robotics keep expanding.
High Precision Localization for Autonomous Driving
Self-driving cars rely on accurate localization to operate safely. SLAM outperforms GPS in responsiveness while providing centimeter-level precision. This allows vehicles to stay in lanes, obey traffic signals, and avoid collisions. Tight urban spaces and intersections raise the stakes further. Fusing inputs from cameras, LiDAR, radar, and motion sensors produces the reliable positioning needed for higher speeds or operation without driver oversight. And in situations like tunnels where satellite signals fade, SLAM takes over completely. The tech remains essential for unlocking the next levels of automated driving.
Drones rely on accurate positioning and mapping to navigate environments. SLAM enables them to immediately adjust flight paths when encountering obstacles. By fusing inputs from cameras, LiDAR, radar, and other sensors, advanced autonomy is unlocked. Drones can safely operate in hazardous sites like disaster zones, underground tunnels, and archaeological excavations. The combination of SLAM with surveying payloads also facilitates rapid 3D modeling of structures and landscapes.
Indoor Scene 3D Reconstruction, Augmented Reality
Apple announced its Vision Pro mixed reality headsets, signaling the start of the "spatial computing" era where indoor 3D reconstruction may be the next hot technology arena. Take a simple example - wearing a VR headset while furniture shopping at home, you can place virtual tables and see realistic lighting and shadows from objects around it. To realize such use cases, AR devices need self-localization and environmental awareness capabilities. Self-localization helps AR devices understand their position in space to render virtual objects like the table in the right place. Environmental awareness provides spatial positions and shapes of physical objects around the virtual one to enable realistic rendering and interaction between real and virtual worlds.
As AR/VR/XR hardware becomes smaller and more mobile, SLAM techniques must advance further. Efficient algorithms will provide the spatial awareness and low latency needed for seamless mixed-reality experiences. The boundary between real and virtual continues blurring thanks to SLAM capabilities.
5. What are the present and future challenges for SLAM?
While SLAM has made great strides, open problems persist around complexity, sensors, and edge cases.
Today's algorithms struggle with dynamics like lighting changes, fast motion, and sprawling spaces. Single sensor suites also hit limitations requiring the fusion of cameras, LiDAR, radar, and more. This exponential increase in data strains processing loads. Tight synchronization and calibration between devices remain non-trivial. And overall system stability suffers from more points of failure.
Looking forward, the critical areas for improvement are:
Multi-Sensor Fusion: Combining diverse data streams leads to alignment and consistency issues. Processing the deluge of raw sensor feeds also hampers real-time performance.
Multi-Primitive Features: Extracting geometric primitives like planes and lines, and then associating those across modalities, lacks robust solutions currently.
Multi-Map Fusion: Merging maps built from different sensors or times require new association techniques to limit complexity and memory burdens.
Geometric Semantics: Labeling geometry with semantic classes takes substantial manual effort today. Tighter integration with learning algorithms is needed.
AI+SLAM: While machine learning shows promise for SLAM, annotated datasets remain time-consuming bottlenecks.
Choosing the Right Data
Consider environment challenges (lighting/texture/weather): Evaluating system robustness in challenging conditions can be done with datasets like TUM MonoVO, Complex Urban, UrbanLoco, and VIODE. Consider different scenes: For multi-scene data, there is RobotCar and H3D for urban, ICL and TUM-VIE indoors, and RUGD forests. Choose datasets with annotations: Assessing with annotated KITTI, TartanAir, RADIATE, VIODE, H3D RUGD, DISCOMAN, IDDA, A*3D, Virtual KITTI 2, TUK Campus, Cirrus. Consider motion patterns: Based on the use case, choose device motion profiles like robot, car, UAV, USV, handheld, and simulation.
Building Custom Datasets with Smart Data Annotation Tools
Lacking suitable public data, custom sets may be needed. Manual annotation demands extensive time and labor, necessitating efficient tools. BasicAI Cloud offers free intelligent data annotation for 3D point clouds, images, video, and sensor fusion data. Its model-powered toolkit provides automatic annotation, object tracking, and 2D/3D semantic segmentation. Multi-level labeling systems also help engineers quickly produce high-quality custom datasets.