Table of Contents

In 2018, the Facebook AI Research (FAIR) team identified a long-standing fragmentation in computer vision:
Semantic segmentation ignored individual objects within the same category, while instance segmentation couldn't handle amorphous background regions like grass or walls.
This split forced applications like autonomous driving to awkwardly merge two separate outputs, often creating boundary overlaps and semantic conflicts.
Alexander Kirillov, Kaiming He, and their colleagues addressed this in their paper "Panoptic Segmentation." They defined a unified task and introduced the Panoptic Quality (PQ) metric to measure performance across both tasks.
This marked the beginning of panoptic segmentation.
In this blog post, we'll dive deep into this important task that's shaping computer vision.
Understanding Things and Stuff: The Foundation of Segmentation
The distinction between "Things" and "Stuff" forms the conceptual backbone of modern image segmentation. This framework recognizes that real-world scenes contain two fundamentally different types of elements.
Semantically, "Things" are countable objects—cars, people, bicycles. They are discrete, separable entities in human cognition. These objects require individual "instance IDs"—two cars in an image become "Car A" and "Car B."
"Stuff" is amorphous material—sky, road, grass. It spreads across regions without distinct boundaries. These regions only need semantic category labels (like "sky"), not instance distinctions—all "sky" pixels share the same label.
It reflects how we naturally parse the world. When you look at a street scene, you will distinguish between the individual cars (Things) and the continuous asphalt (Stuff).
Early computer vision focused primarily on detecting Things. In 2001, Edward Adelson argued we were missing half the picture by ignoring Stuff.
His insight reveals machine vision's leap from "perceiving pixels" to "understanding scenes." The Things-Stuff distinction continues to shape visual understanding today.
Three Image Segmentation Tasks
The Things-Stuff divide led directly to traditional segmentation tasks' separation—and ultimately to panoptic segmentation's emergence.
Semantic Segmentation
Semantic segmentation handles Stuff-dominated scenes (terrain classification, medical tissue analysis). It assigns semantic labels to each pixel ("road," "sky") but ignores individual differences within the same category—all vehicles become "car," not "car #1" or "car #2."
This works well for applications that care about regions rather than individuals, like drivable area identification or segmenting satellite imagery.
Instance Segmentation
Instance segmentation focuses on Things-dominated scenes (robotic grasping, crowd counting). It detects bounding boxes and generates pixel-level masks to distinguish individuals, but completely ignores backgrounds and amorphous regions—missing the semantics of roads or sky.

What is Panoptic Segmentation?
Panoptic segmentation bridges this gap by unifying semantic and instance segmentation.
For countable objects (Things like vehicles, pedestrians), it assigns both semantic categories and unique instance IDs to distinguish individuals. For uncountable regions (Stuff like sky, grass), it only needs semantic labels without instance IDs.
The core goal: complete scene parsing—assigning both semantic labels and instance IDs to every pixel.
💡 Semantic labels indicate object categories; instance IDs distinguish different objects within the same category.
Take autonomous driving, vehicles (Things) need unique IDs for collision avoidance, while road surfaces (Stuff) only need semantic labels for path planning.
Panoptic vs. Semantic Segmentation
Semantic segmentation is classification-oriented; panoptic segmentation is scene-understanding oriented. The former outputs pixel-wise classification maps; the latter produces pixel-wise scene parsing.
Panoptic vs. Instance Segmentation
Instance segmentation allows overlapping masks between instances (overlapping vehicles). Panoptic segmentation enforces pixel-level non-overlap—each pixel belongs to exactly one instance or semantic region.
Why Panoptic Segmentation Matters?
Since the separation between semantic and instance segmentation created algorithmic redundancy (requiring multiple model outputs) and pixel-level label conflicts, the core innovation of panoptic segmentation lies in the binary division between Things and Stuff.
Its unified output framework and non-overlap rule enables AI applications to understand relationships and functions of all scene elements.
Autonomous vehicles must simultaneously perceive movable objects (Things instance IDs) to calculate collision risks while recognizing static environments (Stuff semantic labels) for path planning.
Similarly, robotic navigation requires instance IDs to locate interactable objects (grasping targets) and Stuff semantics to build navigable area maps (grass vs. walls).
In medical imaging, panoptic segmentation analyzes both tissue pathology regions (Stuff) and lesion cell instances (Things), improving diagnostic efficiency.
These applications stem from the need for complete scene understanding. Panoptic segmentation isn't simply semantic plus instance segmentation—it catalyzes machine vision's evolution from perceiving pixels to understanding the world. Its mission of "comprehensive observation" makes it irreplaceable in core domains and essential for embodied AI.
Panoptic Segmentation Methods
Two-Stage Methods
Early panoptic segmentation methods adopted two-stage strategies: separately processing semantic and instance segmentation, then fusing results.
Panoptic FPN exemplifies this approach. Built on Mask R-CNN, it adds a semantic segmentation branch, using shared feature pyramid networks for simultaneous instance and semantic segmentation.

Post-processing must resolve conflicts between branch predictions, typically using heuristic rules like prioritizing instance segmentation results (usually more accurate).
UPSNet improved this architecture by introducing a dedicated panoptic head to learn better fusion of semantic and instance information. Rather than simple heuristic rules, it uses learnable parameters to determine each pixel's final assignment. This design enables end-to-end optimization of panoptic segmentation performance, significantly improving segmentation quality.
Single-Stage Methods
Researchers developed various single-stage methods to simplify panoptic segmentation workflows and improve efficiency.
DETR (DEtection TRansformer) brought breakthrough progress by transforming object detection into set prediction problems, avoiding complex post-processing.
Building on DETR's ideas, Max-DeepLab proposed the first truly end-to-end panoptic segmentation method, using dual-path architecture and global memory mechanisms to directly predict panoptic segmentation results.

Panoptic-DeepLab approached differently, treating panoptic segmentation as dense prediction. The method predicts instance centers and offset vectors for each pixel, obtaining final instance segmentation through simple clustering.
This bottom-up approach avoids complex region proposal mechanisms and excels in real-time performance.
Transformer-Based Methods
Transformer architecture in computer vision also transformed panoptic segmentation methods.
MaskFormer redefined panoptic segmentation as mask classification, using Transformer decoders to predict mask sets with corresponding categories, training via Hungarian matching algorithms. This method unified semantic and instance segmentation processing, demonstrating excellent performance.

Mask2Former improved on MaskFormer with multiple enhancements including mask attention mechanisms and optimized query initialization strategies. These improvements enabled better handling of different object scales, particularly significant improvements in small object segmentation.
OneFormer went further, proposing a unified framework handling semantic segmentation, instance segmentation, and panoptic segmentation simultaneously through task-conditioned design achieving multi-task learning advantages.
Measuring Panoptic Segmentation Accuracy
Panoptic segmentation accuracy evaluation centers on unifying quantification of semantic segmentation (category recognition) and instance segmentation (individual distinction). The evaluation system considers not just pixel-level classification accuracy but emphasizes scene parsing completeness and consistency.
The authoritative metric is Panoptic Quality (PQ), proposed by FAIR in 2018. Its design philosophy unifies scene understanding's dual capabilities of "recognition" and "segmentation":
PQ = SQ × RQ
Segmentation Quality (SQ) measures successfully matched instance segmentation precision—average IoU between predicted and ground truth masks.
Recognition Quality (RQ) reflects instance detection accuracy, calculated similarly to F1 scores, comprehensively evaluating instance recall and precision (IoU>0.5 considered successful match).
PQ enforces non-overlap rules: each pixel belongs to only one prediction (semantic region or instance), avoiding evaluation distortion from conflicting semantic and instance segmentation results in traditional methods.
Additional metrics diagnose model weaknesses: semantic segmentation metrics mIoU and PA, instance segmentation metric AP. When PQ is low, high mIoU but low AP indicates the model excels at background parsing but lacks instance recognition capability.
Parsing Coverage (PC) measures model coverage of rare categories, particularly suitable for open-vocabulary panoptic segmentation (like CLIP-driven OOOPS models).
Panoptic Segmentation Annotation
What is Panoptic Segmentation Annotation?
Real projects often require building large-scale datasets before developing and training various segmentation algorithms. Training dataset production requires panoptic segmentation annotation workflows—data preparation work requiring human expert knowledge and judgment to create ground truth labels.

Specifically, panoptic segmentation annotation assigns semantic category labels and instance IDs to every image pixel.
Unlike regular image segmentation, panoptic segmentation annotation must simultaneously handle "Stuff" categories and "Thing" categories, distinguishing different instances for "Thing" categories.
This annotation method provides complete image understanding with clear attribution for every pixel.
Panoptic Segmentation Annotation Methods
Panoptic segmentation annotation typically employs hierarchical annotation strategies.
First, perform semantic segmentation, assigning category labels to each region. Using BasicAI Data Annotation Platform as an example:
For image panoptic segmentation, use polygon tools or brushes to outline object contours, then use Fill tools to fill regions. For regular-shaped objects, geometric tools (rectangles, circles) improve efficiency.
For point cloud panoptic segmentation, use lasso brushes to select points with identical semantics. To avoid overlapping points in 3D space, continuously adjust viewpoints or filter display ranges. See the blog post about 3D point cloud segmentation guide for specific methods.

After completing semantic annotation, distinguish instances for "Thing" categories, assigning unique instance IDs to different objects of the same category.
Modern annotation tools like BasicAI Data Annotation Platform typically integrate AI assistance features. Pre-trained models like SAM (Segment Anything Model) or point cloud segmentation models provide initial segmentation suggestions; annotators only need to correct and confirm.
Note during annotation: every image pixel has a corresponding semantic category and instance ID (empty annotation if uncertain). In panoptic segmentation, all semantic categories belong to either Stuff or Things, never both; Stuff categories have no instance IDs (unified as one).
Applications of Panoptic Segmentation
Panoptic segmentation provides both pixel-level semantic understanding and instance distinction, upgrading traditional vision tasks and enabling many new applications.
Autonomous Driving
Through panoptic segmentation, autonomous driving systems accurately identify drivable areas, distinguish different lanes, and recognize road boundaries and obstacles. For dynamic objects like pedestrians and vehicles, systems not only recognize categories but distinguish individuals—crucial for trajectory prediction and collision avoidance.

In 2019, Uber ATG integrated panoptic segmentation modules into their autonomous driving system for simultaneous road structure understanding and traffic participant identification. Though preliminary, this attempt initiated industry attention to panoptic segmentation.
Tesla's FSD (Full Self-Driving) system employs "Occupancy Networks," upgrading traditional 3D detection cuboids to voxel-level scene understanding. This change enables vehicles to recognize irregular objects like fallen traffic cones and scattered cargo, substantially improving system safety.
Robotics and Automation
Robotic navigation and interaction closely resemble autonomous driving—panoptic segmentation provides critical information for robots to understand and manipulate environments.
Service robots navigate complex indoor environments, identifying furniture, doors, and windows (static objects) while avoiding moving people and pets. Panoptic segmentation enables robots to build semantic maps, knowing not just where obstacles exist but understanding their nature.
Industrial robots performing grasping and assembly tasks need precise identification and localization of different parts. Panoptic segmentation helps robots distinguish overlapping or closely arranged objects, even within the same category.
Agricultural robotics represents another important application. In precision agriculture, robots must identify crops, weeds, and soil, taking appropriate measures for different regions.
Panoptic segmentation precisely delineates different field areas, helping robots perform selective spraying, precision fertilization, or selective harvesting—greatly improving agricultural efficiency while reducing resource waste.
Medical Imaging Analysis
Medical fields demand extremely high image segmentation precision, where panoptic segmentation demonstrates unique value. In pathological image analysis, doctors must identify different cell types and tissue structures while counting specific cell types.

In 2019, Google Health applied panoptic segmentation to breast cancer pathology slide analysis, identifying not just cancer cells but distinguishing different subtypes and surrounding healthy tissue.
In organ and tumor segmentation, panoptic segmentation helps doctors better understand relationships between lesion regions and surrounding normal tissue. For liver tumor diagnosis, systems must simultaneously segment liver, blood vessels, and multiple tumor lesions. Panoptic segmentation clearly shows these structures' spatial relationships, assisting surgical planning.
Augmented and Virtual Reality
Panoptic segmentation opens new interactive possibilities in AR/VR. In AR applications, accurate scene understanding enables natural fusion of virtual objects with real environments. Panoptic segmentation identifies different surfaces and objects in scenes, allowing virtual content to correctly place on tables, hang on walls, or display around real objects.
In 2021, Meta first applied real-time panoptic segmentation in Oculus Quest 2's Passthrough+ feature. Users could see specific real-world objects like keyboards and cups within VR environments, achieving mixed reality interaction.
Apple's 2023 Vision Pro pushed it to new heights. The device segments users' hands, surrounding furniture, and people in real-time, creating "spatial computing" experiences. Users can "pin" virtual screens to walls; virtual objects are correctly occluded by real furniture, delivering unprecedented immersion.
In AR/VR and entertainment domains like social media, panoptic segmentation is becoming key technology for reality-virtuality fusion.
Smart Retail
Amazon's Just Walk Out technology represents panoptic segmentation's most successful retail application.
Since the first Amazon Go store opened in 2018, the technology has deployed in over 40 stores globally. The system uses ceiling cameras for panoptic segmentation, tracking each customer and their selected items with 98.7% accuracy.
As computing power increases and algorithms optimize, panoptic segmentation expands to more domains.
In 2023, OpenAI demonstrated GPT-4V's panoptic segmentation capabilities in understanding complex scenes, signaling the large model era's arrival. NVIDIA's latest research shows panoptic segmentation combined with 3D reconstruction can create digital twin cities, offering new possibilities for smart city construction.
From technology development trajectories, panoptic segmentation has grown from academic concept to core technology supporting multiple industries.
Key Panoptic Segmentation Datasets
COCO Panoptic
COCO Panoptic stands as one of panoptic segmentation's most influential datasets. As an extension of the COCO dataset, it launched in 2018 alongside the panoptic segmentation task proposal.
The dataset contains over 118K training images across 133 categories (80 things, 53 stuff). Average of 7.7 instances per image provides diverse training scenarios.
COCO Panoptic's diversity makes it the standard benchmark for evaluating algorithm generalization. Its annotation quality sets standards for other datasets.
Cityscapes
Cityscapes dataset focuses on urban street scene understanding—an important resource for autonomous driving research. The dataset includes 5K finely annotated images and 20K coarsely annotated images, collected from 50 different European cities.

Cityscapes is unique due to its annotation precision. Every pixel has accurate category labels covering 30 categories including roads, sidewalks, buildings, traffic signs, pedestrians, and vehicles.
The dataset also provides stereo image pairs and GPS information, supporting 3D scene understanding research. Its strict annotation protocol ensures high-quality ground truth, with average annotation time exceeding 90 minutes per image.
ADE20K
ADE20K dataset has diverse scenes and rich annotations. Released by MIT, this dataset contains over 20K training images and 2K validation images, covering 150 semantic categories.
Compared to other datasets, ADE20K features extremely rich scene types, from indoor kitchens and bedrooms to outdoor mountains and beaches, covering virtually all human living environments.
Dataset annotations include not only pixel-level segmentation but also object part information and scene semantic attributes, supporting deep scene understanding research.
KITTI-360
KITTI-360 dataset extends KITTI for panoptic segmentation, designed for autonomous driving research. The dataset provides over 300K laser scans with corresponding panoptic annotations, covering 73.7 kilometers of driving.

KITTI-360 features perfect multi-modal data alignment including stereo camera images, 3D LiDAR point clouds, and GPS/IMU data.
This multi-modal nature enables researchers to explore 2D&3D fusion panoptic segmentation methods. The dataset also provides temporally consistent annotations supporting video panoptic segmentation research.
Future Directions
The field is evolving rapidly across several fronts:
Large Models: SAM (Segment Anything Model) and similar large-scale models provide strong priors for panoptic segmentation. Fine-tuning these models dramatically reduces data requirements for new domains.
Open Vocabulary and Zero-Shot Learning: CLIP-based methods enable segmentation of arbitrary concepts without retraining. Through meta-learning and self-supervised methods, models adapt to new scenes with minimal or no annotated samples, dramatically lowering application barriers.
3D Point Cloud Panoptic Segmentation: LiDAR Point cloud panoptic segmentation directly processing lidar and other 3D sensor data through sparse convolutions and graph neural networks, providing precise 3D scene understanding for autonomous driving and robotics. Latest methods achieve joint 2D image and 3D point cloud panoptic segmentation, using multimodal fusion to improve complex scene segmentation precision and robustness.
Synergy with Other Vision Tasks: Panoptic segmentation deeply integrates with depth estimation, 3D reconstruction, and object tracking, forming unified scene understanding frameworks. Multi-task learning brings complementary performance and improves computational efficiency.
Summary
Since FAIR first proposed panoptic segmentation in 2018, this technology has moved from academic research to industrial application, profoundly changing computer vision's trajectory.
By unifying Things and Stuff processing and merging semantic with instance segmentation, panoptic segmentation achieves machines' complete understanding of the visual world.
A short recap of what we have covered in this post:
Panoptic segmentation's essence is unified processing of Things (countable objects) and Stuff (uncountable materials).
From two-stage fusion to end-to-end architectures to Transformer applications, technical evolution shows clear optimization paths.
The PQ (Panoptic Quality) metric cleverly unifies segmentation precision and recognition accuracy evaluation through SQ×RQ design.
High-quality datasets like COCO Panoptic and Cityscapes drive algorithm development, with precise annotation forming these datasets' foundation.
In actual project development, building high-quality training datasets often represents the most time-consuming yet critical phase. BasicAI, as a leading data annotation service provider, offers complete expert and toolset support for panoptic segmentation tasks, helping teams rapidly build required training data.
Particularly when handling complex instance distinction and 3D point cloud segmentation, the platform's AI assistance features significantly improve efficiency, letting researchers and engineers focus more on algorithmic innovation and application deployment.
Click below to discuss your panoptic segmentation project solutions with us.





