At first glance, general large models are unpredictable. This seems at odds with the rigor required for autonomous driving.
Is there a way to apply large models to self-driving in a way that produces even better, flawless solutions?
The answer is YES.
At the 2023 CVPR workshop on autonomous driving, both Tesla and Wayve discussed their latest explorations of leveraging large generative models – using large models to generate continuous video driving scenarios. Both are still in the experimental stage, but in fact, the concept of large models has been introduced to autonomous driving for many years.
Today, we'll talk about applying large models in autonomous driving: its history, current status, and trends. Let's see how large models are leading autonomous driving into the 3.0 era.
Transformers Usher in the Era of Large Models
Large models generally have billions or hundreds of billions of parameters. This vast parameter space enables handling more complex tasks. Large language models (LLMs) are a typical type, using massive datasets to identify, summarize, translate, predict, and generate content.
To a large extent, large language models represent a class of deep learning architectures called Transformer networks. Transformers learn context and meaning by tracking relationships in sequences of data. Transformers sparked a new era for large language models.
The biggest innovation of Transformer models was introducing the attention mechanism, which greatly improved the model's ability to learn long-range dependencies. Previously, natural language processing used RNNs or CNNs to model semantics. However, RNNs diminish earlier information over sequences. CNNs have local perceptions limiting global capture. So both faced difficulties fully learning long word dependencies.
The Transformer attention mechanism broke inherent RNN/CNN long sequence processing limitations. This allowed language models to obtain rich linguistic knowledge from large corpus pretraining. Their modular, scalable nature also enables expanding expressiveness by increasing modules, providing a path to huge parameters. Transformers solved previous models’ sequence issues and provided infinitely scalable structures - laying the large model groundwork.
Applications exist in some fields. Large language models like GPT and Claude can capture complex language semantics for human-level interaction. In healthcare, they process images and reports to aid diagnosis. In autonomous driving, they have the potential to analyze multimodal sensor data – improving complex scene understanding and decision-making.
Before delving deeper, let’s briefly review autonomous driving algorithm development.
The Path of Autonomous Driving Algorithms: From CNNs to Occupancy Networks
Autonomous driving algorithm modules can be divided into perception, decision-making, and planning/control. In the R&D process, engineers devote most time and effort to improving perception algorithm accuracy. The perception module parses and understands the traffic environment around the vehicle, which is the basis and prerequisite for realizing autonomous driving. Perception accuracy directly affects and restricts overall safety and reliability.
The perception module mainly obtains input data through sensors like cameras, LiDARs, and radars, then uses deep learning algorithms to accurately parse road markings, vehicles, pedestrians, and other elements for subsequent processes.
Compared to perception, the decision-making and planning/control modules play more singular, passive roles. These modules generate driving strategies based on environmental understanding from perception, plan motion trajectory, and speed in real-time, and convert them into control commands to achieve self-driving.
Autonomous driving algorithms have gone through multiple iterations.
Traditional Small Models: CNNs, RNNs and GANs
Convolutional neural networks' (CNNs) outstanding image recognition sparked autonomous driving's first innovation wave. In 2011, CNNs showed superior image processing, enabling accurate road/object analysis. Effectively fusing multisensor data provides comprehensive cognition. Added computing efficiency allowed complex real-time perception and decision-making, greatly enhancing environmental awareness.
However, CNN-based autonomous driving requires a lot of labeled driving data for training, which is difficult to obtain in sufficient diversity. In addition, there is room for improving generalization and robustness in complex environments.
For sequential task processing, models like RNNs may have more advantages. RNNs' cyclic structure is useful for dealing with autonomous driving time series tasks like trajectory prediction and behavior analysis, predicting vehicle future trajectories to support planning. GANs' generative capabilities alleviate insufficient training data. GANs can learn complex distributions and generate high-quality synthetic data, bringing new ideas. RNNs and GANs work synergistically, with RNNs responsible for time series modeling and GANs generating data, to provide more comprehensive and reliable environmental awareness, state prediction, and decision support for autonomous driving systems, promoting rapid 2016-2018 development.
However, RNNs may have gradient problems with longer time series data. It isn't easy to control GAN model data quality. Large real-world scenario datasets are still needed to train and optimize models. Autonomous driving systems require efficiency, real-time performance, and strong generalization for complex traffic scenes.
Transformers + BEV
The core idea of Bird's Eye View (BEV) models is to project 3D environmental data around the vehicle (from LiDARs, cameras, etc.) onto a bird's eye view plane to generate a 2D map. The BEV map allows clearer observation of the position and relationship of elements like roads, vehicles, and signs, which is more conducive to path planning and decision-making. BEV models can unify multisensor inputs into a shared representation, providing more consistent and comprehensive environmental information.
However, BEV needs complex calibration and data processing with high computing requirements. Projection loses some 3D insights. After 2020, Transformer + BEV became a popular new solution.
Transformers have shown unique advantages in dealing with sequence data and complex contexts (validated in NLP). They also have potential in multisource heterogeneous data modeling. BEV can efficiently express abundant spatial information. Combined, they can achieve more accurate awareness, longer-range planning, and more globalized decision-making.
Yet BEV lacks height, best for lanes but not 3D volume. In 2023, Tesla released the Occupancy Network model, upgrading BEV.
Compared to BEV, Occupancy Networks can reconstruct the 3D environment more accurately, enhancing awareness. They contain an encoder that learns rich semantics, and a decoder that generates 3D representations. By learning to represent 3D surfaces as neural network decision boundaries without requiring LiDAR point clouds, they integrate perceived geometry and semantics to obtain accurate scene info. Occupancy network technology allows Tesla to make full use of unlabeled data to effectively supplement labeled datasets, which is important for improving safety and reducing accidents.
Autonomous Driving in the Large Model Era
Inspiration from GPT
GPT-like large models typically use Transformer structures for distributed training and have inspired autonomous driving researchers to use similar architectures for end-to-end learning, and even pre-trained models specifically designed for autonomous driving.
Pretraining enables powerful semantic representations to understand complex/abnormal scenes and boost robust perception. They also better generalize to new environments. By finely modeling sequences, large models can infer causal relationships over time and better predict dynamic environment changes. They can fuse different multimodal inputs (images, point clouds, etc.) to achieve comprehensive scene understanding. Large models have powerful capabilities for direct end-to-end learning from sensor inputs to driving outputs, simplifying architectures.
Accelerating Autonomous Driving Model Maturity
Applying large models sets clearer L3/L4 landing expectations. Tesla's Transformer + BEV + Occupancy Network algorithm allows vehicles to more accurately understand complex traffic, providing stronger L3/L4 environmental awareness. Significant perception, decision, and control progress comes from optimized algorithms like Transformer + BEV and Occupancy Networks, enabling efficient and accurate data processing for environmental cognition. Assisted real-time planning and decision-making enables safe driving in complexity, laying commercial foundations.
Applying Large Models to Autonomous Driving
Currently, large models are mainly applied to perception and prediction. Transformer models extract features from BEV data for obstacle monitoring and positioning. The prediction layer uses Transformers to capture traffic participant motion patterns and trajectories to predict future behaviors.
For planning/decision-making, autonomous driving generally still uses rule-based methods to generate driving strategies. With increasing autonomy and expanding applications, rule-based planning/control has limitations in corner cases. In the future, large models are expected to gradually shift strategy generation from rule-driven to data-driven. Combined with vehicle dynamics, Transformers can generate appropriate strategies: integrate the environment, road conditions, vehicle status, and other data, and the multi-head attention balances information sources to make reasonable quick decisions in complex environments.
Other Demand Changes Autonomous Driving Will Bring
The application of large models is still in the early stages, and there are still problems with multimodal fusion, large computing demands, onboard deployment, and safety/consistency for smart cars. With optimization and application, large models' role in autonomous driving may increase rapidly.
Increased Onboard Computing Requirements
Higher autonomous driving levels use more sensors and generate more data. With large models shifting scenarios toward urban roads, relying too much on cloud processing leads to reduced efficiency and delays, greatly affecting experience. Edge computing can pre-process and filter useless data before cloud upload.
Increased 4D Sensor Demand
Currently, LiDAR remains an important smart car supplementary sensor, with superior distance/spatial accuracy. Multi-sensor fusion solutions with LiDARs achieve full perception through complementarity, providing higher level redundancy. Large-scale LiDARs will also reduce costs. In the short term, LiDAR is expected to remain important. But new sensor types may upgrade smart cars to 4D imaging radars.
4D radars add pitch angle perception to traditional radars, presenting rich point cloud images with target/environment distance, speed, and angle. They adapt to more complex conditions like small, occluded, and stationary lateral moving objects. Some 4D radar specs approximate 16-line LiDAR performance at 1/10 the cost. Tesla first equipped 4D radar in S/X models with HW 4.0. 4D radars are expected to penetrate mid-to-high-end and autonomous models quickly.
Increased Real-World Scenario Data Demand
Large models promote rapid autonomous driving development, and evolving sensors also increase sensor data demand. But corner case solutions need real-world data. Companies use synthesized data to simulate the real world for training. However, synthetic data cannot fully simulate real-world complexity/variability. Overuse allows great performance during training but overwhelms deployment. In addition, annotation difficulty increases. Early stages may have only required vehicle identification, but technology now needs traffic light recognition too. Data annotation will become more refined, and new sensors will also increase difficulty. Therefore, the right data partner choice will become very important.
Finding the Right Data Partner for Autonomous Vehicle Models
Human-Machine Coupled Multi-Sensor Annotation Becomes Necessary
Driven by large models and declining sensor costs, autonomous driving is advancing comprehensively towards L3+. L3+ levels rely heavily on computer vision, needing real-time sensor data processing to construct driving environments for prediction and decision-making, posing algorithm accuracy and real-time challenges. Currently, supervised learning is still the main approach, requiring large annotated datasets for training and optimization. According to a Dimensional Research global survey, 72% believe over 100,000 training data pieces are required to ensure model effectiveness and reliability. 96% encountered insufficient training data quality/quantity and annotation staff issues when training models.
Given huge data needs, human-machine collaborative annotation will become the new paradigm, improving efficiency and saving costs. Meanwhile, L3+ autonomous driving requires substantial 3D point cloud data support, with high annotation difficulty, posing an even greater industry challenge.
Why Industry Leaders Choose BasicAI as an Autonomous Driving Data Partner
BasicAI provides professional datasets for AI algorithm model development and training. Since its inception, BasicAI has built LiDAR tools combining machine learning and manual annotation, improving high-quality efficiency 82 times over pure manual. Its proprietary BasicAI Cloud 3D point cloud tools have become industry benchmarks, supporting automatic annotation through 3D sensor fusion, semantic segmentation, 3D object tracking, etc. It also pioneered 4D BEV data annotation support, providing powerful L3+ algorithm data. BasicAI also serves Fortune 500s, solving massive data issues and providing quality training data services for cutting-edge models.