Table of Contents
Best Student Paper Honorable Mentions

CVPR 2025: Conference Highlights
CVPR 2025 wrapped up in Nashville this June, delivering another impressive showcase of computer vision research.
The conference drew 9,375 attendees from 75 countries, all competing for attention among 2,878 accepted papers—just 22.1% of the 13,008 submissions made the cut.
Beyond the main conference, 118 workshops and 25 tutorials explored computer vision applications in autonomous driving, healthcare, and robotics. Research themes spanned image and video synthesis, multi-view 3D reconstruction, multimodal learning, face and pose analysis, low-level vision, and visual-language reasoning.
Two papers claimed the top honors: "VGGT: Visual Geometry Grounded Transformer" won Best Paper, while "Neural Inverse Rendering from Propagating Light" earned Best Student Paper. The conference also featured an AI art program showcasing 102 works exploring the intersection of science and art.
In this post, let's examine these standout works and their implications for future research and applications.
CVPR 2025 Best Paper: VGGT: Visual Geometry Grounded Transformer
A collaboration between Oxford's Visual Geometry Group and Meta AI produced this year's sole Best Paper winner, representing a breakthrough in 3D computer vision and multi-view geometry.

The Problem
Most 3D scene reconstruction systems still rely on Bundle Adjustment (BA) and other iterative optimization techniques that work but are computationally expensive. While machine learning has improved pieces of the puzzle — like feature matching — the core pipeline hasn't changed much. This creates a bottleneck for real-time applications.
As neural networks grow more powerful, can we solve 3D tasks directly with neural networks while nearly eliminating geometric post-processing?
The Solution
VGGT (Visual Geometry Grounded Transformer) takes a different approach. Instead of treating 3D reconstruction as an optimization problem, it treats it as a prediction problem. Feed the model anywhere from one to hundreds of images, and it directly outputs camera parameters, depth maps, point correspondences, and trajectories.
The architecture uses alternating attention mechanisms that switch between analyzing individual frames and integrating information across all images. This design choice helps balance local detail with global consistency. The model also predicts multiple related 3D properties simultaneously, which improves accuracy even when some predictions are redundant.
Why It Matters
The results speak for themselves. On RealEstate10K, VGGT achieves 85.3 AUC@30 in 0.2 seconds, while comparable methods like DUSt3R take 7-10 seconds. For point cloud estimation on ETH3D, it reduces Chamfer distance to 0.677 while running 45 times faster.
VGGT represents a shift toward treating 3D vision as a learning problem rather than an optimization problem. That opens doors for AR applications, robotic navigation, and autonomous driving systems that need real-time 3D understanding.
CVPR 2025 Best Student Paper: Neural Inverse Rendering from Propagating Light
This Best Student Paper winner comes from researchers at University of Toronto, Vector Institute, and Carnegie Mellon University, proposing a physics-based neural inverse rendering method.

The Problem
LiDAR systems typically focus on direct light reflections—light that bounces once off a surface and returns to the sensor.
But light also bounces multiple times before returning, carrying information about materials, geometry, and scene structure that current systems throw away.
The challenge is that this indirect light is much harder to interpret. Traditional methods treat it as noise, but it actually contains rich information if you know how to decode it.
The Solution
The team developed a time-resolved radiance cache—essentially a neural network that learns to store and query information about how light propagates through a scene over time.
Simply put, it enables LiDAR to not only see direct light but understand indirect light, using this information for scene reconstruction.
The method works in two steps.
First, build a "time-resolved radiance cache" of light transport that tracks how illumination propagates through the scene.
Second, use neural networks to make queries to this memory efficient enough for practical use.
Beyond better 3D reconstruction in challenging lighting conditions, the system enables some novel capabilities. It can synthesize videos showing how light propagates from new viewpoints, automatically separate direct and indirect illumination, and even relight captured scenes with new light sources.
Why It Matters
This research could enhance autonomous driving perception in complex lighting conditions, better handling specular reflections and multiple scattering effects. For fundamental research, it demonstrates how combining physics with neural networks can unlock capabilities that neither approach achieves alone.
CVPR 2025 Best Paper Honorable Mentions
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
Google Research's Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski tackle a fundamental challenge in computer vision.

The Problem
Traditional Structure-from-Motion or SLAM techniques struggle with casually captured dynamic videos. These videos typically feature limited camera motion, uncertain fields of view, and complex scene dynamics, causing traditional methods to frequently fail.
The Solution
MegaSaM rebuilds the visual SLAM pipeline from the ground up for dynamic scenes. The system integrates monocular depth priors with motion probability graphs in a differentiable framework. The key insight is using uncertainty-aware global bundle adjustment that can handle the ambiguities inherent in casual video capture.
Rather than requiring static scenes or significant camera motion, MegaSaM works with minimal parallax and complex dynamics. The system jointly estimates camera trajectories and scene depth while accounting for the uncertainty in both measurements and motion patterns.
Why It Matters
MegaSaM significantly outperforms existing methods on both synthetic and real datasets while maintaining efficient performance.
AR/VR applications can now work with arbitrary user-generated content rather than requiring specialized capture procedures.
The broader impact extends to autonomous systems that need to understand dynamic environments and robotics applications where controlled data capture isn't feasible.
Navigation World Models
This collaboration between FAIR at Meta, NYU, and Berkeley AI Research, led by Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun, rethinks how robots navigate by teaching them to imagine.

The Problem
Current robotic navigation systems learn fixed behaviors during training. If you train a robot to navigate hallways, it struggles when you later decide it should avoid certain areas or follow new constraints.
These systems lack the flexibility humans have when navigating—we can easily incorporate new rules like "avoid the construction zone" or "stay on the right side."
The Solution
Navigation World Models (NWM) learns to predict what a robot would see as it moves through an environment. This predictive capability allows planning by simulation—the system can mentally explore different paths and evaluate which ones successfully reach the goal while satisfying constraints.
NWM uses conditional diffusion transformer (CDiT) architecture trained on diverse first-person videos from both humans and robots, scaling to 1 billion parameters. The model learns rich visual representations that generalize across environments and tasks.
In familiar environments, NWM plans complete trajectories by simulating forward and evaluating success.
Why It Matters
Unlike fixed navigation policies, it can dynamically incorporate new constraints during planning. Perhaps most impressively, the system can imagine possible navigation paths in completely novel environments using only a single input image.
This represents a shift from reactive navigation systems toward predictive ones that can reason about future states and adapt to changing requirements.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
This research introduced the Molmo family of vision-language models, among the strongest open-source models at the time*. With 7.2 billion parameters, it achieved open-source SOTA and surpassed Claude 3.5 Sonnet and Gemini 1.5 Pro.
*Note: first version published September 2024

Background
The best vision-language models (like GPT-4o, Gemini 1.5 Pro) remain proprietary, with undisclosed weights and training data. Existing open-source models either rely on closed-source data or heavily depend on synthetic data generated from proprietary VLMs, essentially distilling these closed-source models.
This leaves academia lacking fundamental knowledge about building high-performance VLMs from scratch.
What It Does
The team launched Molmo (Multimodal Open Language Model), a family of fully open-source VLMs, along with the corresponding PixMo dataset.
Molmo uses standard vision encoder (ViT) + language model architecture. For model design and optimization, Molmo proposed several new strategies: overlapping multi-crop image processing, improved vision-language connection modules, and training pipelines supporting pointing capabilities. These innovations improved model performance on complex visual tasks like localization, counting, and natural image understanding.
The PixMo dataset includes detailed image descriptions for pre-training, free-form Q&A data for fine-tuning, and innovative 2D pointing datasets, all independent of external VLM generation.
Why It Matters
By providing complete transparency—architecture details, training procedures, and clean training data—this work enables researchers to understand what actually makes vision-language models work.
This level of openness accelerates research by giving everyone the same foundation to build upon, rather than forcing researchers to reverse-engineer or depend on black-box systems.
3D Student Splatting and Scooping
Researchers from UCL and University of Leeds made a fundamental improvement to 3D Gaussian Splatting, one of the most influential recent techniques in neural rendering.

The Problem
3D Gaussian Splatting (3DGS) has generated tremendous interest in novel view synthesis, quickly becoming a foundational component in numerous 3D reconstruction and neural rendering systems.
However, the technique has two limitations: Gaussian distributions may not be the optimal choice for all scene content, and the method can only add density, never subtract it.
The Solution
Student Splatting and Scooping (SSS) replaces Gaussian distributions with Student's t-distributions, which include Gaussians as a special case but can represent a much wider range of shapes by learning the tail thickness parameter. This flexibility allows the model to adapt the distribution shape to match the actual scene content.
The "scooping" component introduces negative density, enabling the model to perform subtraction operations in density space. This is conceptually similar to constructive solid geometry — you can now carve out spaces rather than just fill them.
The team also addresses the technical challenges these improvements create, particularly parameter coupling issues, by incorporating Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) sampling methods.
Why It Matters
SSS achieves the same or better rendering quality while using up to 82% fewer components than standard Gaussian splatting.
This dramatic efficiency improvement has immediate practical benefits for VR/AR applications where computational resources are constrained.
The work also opens new research directions by demonstrating that careful consideration of the underlying mathematical primitives can yield significant improvements in neural rendering systems.
Best Student Paper Honorable Mention
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

The Problem
Vision-language models face competing objectives. For understanding tasks, you want to compress visual information into abstract representations that capture meaning while discarding irrelevant details. For generation tasks, you need to preserve enough detail to reconstruct high-quality images.
Current approaches typically use spatial visual tokens—essentially treating images as sequences of patches arranged in spatial order. However, research shows these spatial sequences lack the recursive structure that makes natural language easy for language models to process. This creates an "impossible language" that LLMs struggle to learn effectively.
The Solution
The team proposed Discrete Diffusion Timestep (DDT) tokenization, solving this problem by learning visual tokens with recursive structure.
Instead of spatial arrangement, these tokens are organized around the diffusion process—each token represents information that would be lost at a particular noise level during image generation.
This creates a natural hierarchy where tokens recursively build upon each other, similar to how language works. The approach leads to DDT-LLaMA, a multimodal model that effectively combines autoregressive language modeling with high-quality image generation.
Why It Matters
DDT represents a new paradigm for visual tokenization that aligns with how language models naturally process information. This could influence how future multimodal models are designed, potentially resolving the tension between understanding and generation tasks.
Other Notable Works
Beyond these papers, this conference featured numerous other noteworthy research. Here we highlight three additional works.

Point Clouds and Object Detection: ViiNeuS: Volumetric Initialization for Implicit Neural Surface Reconstruction of Urban Scenes with Limited Image Overlap
Neural implicit surface reconstruction works well in controlled settings but struggles with urban driving scenarios.
These environments are vast, geometrically complex, and captured with limited viewpoint overlap. Traditional approaches require additional LiDAR data, strong geometric assumptions, or extensive training time—all barriers to practical deployment.
Djeghim et al. proposed ViiNeuS, a hybrid architecture that simultaneously models volumetric density and signed distance fields.
The key innovation is a self-supervised probability density estimation approach that smoothly transitions from volumetric to surface representation by intelligently sampling points near surfaces.
Testing on KITTI-360, Pandaset, Waymo, and nuScenes datasets shows ViiNeuS produces more accurate surface reconstructions while training twice as fast as previous methods like StreetSurf. The system represents entire complex urban scenes with a single hybrid implicit field.
This advancement matters for autonomous driving simulation, urban planning applications, and any system that needs detailed 3D models of real-world environments.
Robotics: RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concret
Existing multimodal language models excel at understanding human instructions and visual scenes but struggle with robotic manipulation tasks.
They lack three critical capabilities: decomposing complex instructions into executable subtasks, understanding which objects can be manipulated and how, and predicting complete manipulation trajectories.
The team created ShareRobot, a heterogeneous dataset with annotations spanning task planning, object affordances, and end-effector trajectories.
RoboBrain builds on LLaVA architecture with careful attention to the balance between robotic training data and general multimodal data.
The training strategy incorporates multi-stage learning and processes both long videos and high-resolution images to develop comprehensive manipulation understanding.
RoboBrain demonstrates how to build manipulation capabilities into language models rather than treating manipulation as a separate problem. This integration enables robots that can understand natural language instructions and translate them into physical actions more effectively.
Autonomous Driving: JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
As we said in previous blog post, autonomous driving perception technology heavily depends on large amounts of annotated LiDAR point cloud data, but this 3D data annotation process is extremely time-consuming and labor-intensive.
Meanwhile, real-world datasets inevitably miss rare but important scenarios—corner cases that could cause system failures.
While simulators like CARLA can generate unlimited annotated data including corner cases, but bridging the gap between simulated and real-world performance remains challenging.
The research team proposed JiSAM (Jittering augmentation, domain-aware backbone, and memory-based Sector Alignment Module) to address simulation data sample efficiency and simulation-to-real domain gap issues.
Using only 2.5% of real annotated data plus simulation data, JiSAM matches the performance of models trained on complete real datasets. More impressively, it achieves over 15 mAP improvement on objects not labeled in the real training set, directly addressing the corner case problem.
This work provides a practical path toward reducing annotation costs while improving coverage of rare but critical scenarios.
What These Papers Indicate...
Examining CVPR 2025's outstanding work reveals several significant trends reshaping computer vision and AI.
Physics-Informed Learning: The most impactful work combines neural networks with physical or geometric principles rather than treating them as competing approaches. VGGT merges Transformers with multi-view geometry. The inverse rendering work integrates light transport physics with neural representations. This trend suggests the field is moving beyond "pure learning" toward more principled hybrid approaches.
Multimodal Integration: Vision-language models achieved new levels of capability and transparency. Molmo demonstrates that open models can compete with proprietary systems when built thoughtfully. The DDT work shows how to better align visual and linguistic representations. These advances indicate that multimodal AI is reaching a level of maturity where architectural innovations matter more than scale alone.
Real-World Deployment: There's a clear shift from benchmark optimization toward handling messy real-world conditions. Navigation World Models adapts to dynamic constraints. ViiNeuS works with limited-overlap urban imagery. This reflects growing confidence in moving from research demonstrations to practical applications.
CVPR 2026 (Jun 6-12, 2026) in Denver should reveal how these directions evolve. The combination of physics-informed architectures, efficient training methods, and transparent development practices points toward a field ready to move beyond the laboratory into widespread practical deployment.
The BasicAI team congratulates all award-winning teams and researchers!
