## 1. A Glimpse into the World of 3D Point Clouds

🤔 *Imagine you're struck by a strange affliction that makes everything you see a blurry, indistinguishable blob of the same color *👨🦯*. You attempt to embrace your loved one 💑, only to collide headfirst with a pillar 🤕. The next time, you cleverly avoid the pillar 😏, but it turns out that was your significant other all along *🤦*.*

This is the issue with 3D point clouds—they provide accurate location information but lack semantic data, making it difficult for AI to distinguish between different objects 😵💫. Just look at the point cloud image above; it resembles a virtual digital space from an '80s cyberpunk movie 👨💻.

Naturally, we imagine that it would be great to combine the location data in 3D space with the visual information from 2D images. And so, the concept of 2D/3D fusion annotation was born 📽️.

## 2. The Principles of Fusion Annotation

Why is it called "fusion" annotation? Can't we just annotate both the 3D point cloud and 2D image separately? Well, that would be inefficient! 🙅

Since cameras can map points from 3D space onto 2D images, we can simulate the "photo-taking" process after annotating a 3D point cloud. This way, the 3D annotation is "captured" in the 2D image without the need for further annotation 📸.

The principle behind camera imaging is the pinhole camera model 🕯️, which you might recall from middle school physics.

Now consider this: everyone perceives the world with themselves as the origin. Since 3D point clouds are generated by LiDAR sensors, their coordinate systems originate in the "eye" of the LiDAR (i.e., LiDAR's origin). Meanwhile, cameras use their own coordinate systems (with the camera's origin), which change based on the camera's position and angle. 🌏

We'll differentiate these coordinate systems as the point cloud coordinate system and the camera coordinates system. 👐

In short, mapping a point from a 3D point cloud to a 2D image involves two steps: first, transforming the point's coordinates from the point cloud coordinate system (3D) to the camera coordinate system (3D); then, simulating the photo-taking process to convert the point to the image coordinate system (2D). 🧐

Linear algebra teaches us that matrix operations correspond to spatial transformations 📐. Essentially, changing coordinate systems is just a matter of matrix manipulation 😌. Two crucial matrices are the intrinsic matrix (from camera coordinate system to image coordinate system) and the extrinsic matrix (from point cloud coordinate system to camera coordinate system) 👨🏫.

## 3. Intrinsic Matrix

For now, let's ignore the extrinsic parameters and consider the coordinate system where the large blue arrow is located as the camera coordinate system. The coordinates of the tip of the arrow are represented as (*X**c*, *Y**c*, *Z**c*) ✍️. The transformation from the camera coordinate system to the image coordinate system is not accomplished in one step. Below, we'll break it down into multiple steps and explain it in detail 👨🏫.

### 3.1 First Spatial Transformation: Linear Transformation

The first spatial transformation is the scaling transformation from the camera coordinate system to the blue coordinate system in the image, which is a linear transformation. Since many materials do not specifically emphasize the blue coordinate system, we'll name it the "intermediate coordinate system" here. According to similar triangles (the triangle formed by the person's head, feet, and camera focal point *F**c*), the coordinates of the small blue arrow can be obtained ✍️:

Here, *f* is the focal length, which is the distance from the origin of the camera coordinate system (the *F**c* in the image) to the image plane, in millimeters ✍️.

At this time, the intermediate coordinate system is not the image coordinate system. The reason is twofold: first, the image's length unit should be pixels rather than millimeters, so a unit conversion is still needed; second, the origin of the intermediate coordinate system is the same as the camera coordinate system, which is the camera focal point, while the origin of the image coordinate system is in the upper left corner of the image, so a translation transformation is also needed ✍️.

### 3.2 Unit Conversion

To convert units, the current coordinates should be multiplied or divided by a factor ✍️. In the *x* and *y* directions, the unit conversion factors are *d**x* and *d**y*, which represent how many millimeters per pixel on the sensor, i.e., millimeters/pixel ✍️. At this point, the coordinates have the same unit as the image coordinate system, so we'll change the coordinate letters to *u' *and *v'*. ✍️

Generally, the focal length and conversion factors are combined and simplified as:

### 3.3 Second Spatial Transformation: Translation Transformation

The second spatial transformation is a translation transformation ✍️, which is a non-linear transformation. After adding a constant to the current coordinates for translation, the final *u* and *v* are obtained as ✍️:

### 3.4 Overall Transformation: Affine Transformation

The above two spatial transformations are equivalent to a single affine transformation, and low-dimensional affine transformations can be achieved through linear transformations in a higher dimension ✍️. First, let's represent the complete affine transformation ✍️:

Increase the dimension! Convert to a linear transformation ✍️:

Multiply both sides of the equation by *z**c *✍️:

If we carefully observe the matrix multiplication equation above, we will find that it completely conforms to our intuitive understanding of the photographic process 😲. The red diagonal matrix corresponds to the scaling of space, that is, the object is scaled to fit into the photo; the blue column corresponds to the translation of the coordinate system origin from the center of the camera coordinate system to the upper left corner of the image; finally, by taking only the first two coordinates (*u*, *v*) from the three coordinates (*u*, *v*, 1), it corresponds to the dimension reduction from 3D to 2D 👏.

In the end, we find that by multiplying the coordinates in the camera coordinate system by a 3x3 matrix, and then dividing by *z**c*, we can obtain the coordinates *u* and *v* of the point in the image coordinate system 🎉.

*This matrix is called the intrinsic matrix 🙆. *

The name of this matrix reflects both the spatial transformation of the coordinates from the outside of the camera to the inside and the fact that these parameters are determined by the internal factors of the camera 📷.

Although the derivation process above is quite complex, there is no need to worry about intermediate variables when actually using it; just input the final *f**x*, *f**y*, *c**x*, and *c**y *👌.