Question about the projection_to_birds_eye_view function

Question

Question about the projection_to_birds_eye_view function

taylover-pei opened this issue 3 years ago · comments

Congratulations on your great work!

I want to follow your work for future research and I have some questions about your released code below:

In the fiery.py file of your code, can you provide more details about the get_geometry function and the projection_to_birds_eye_view function? I'm so confused about how they actually work, especially, the code shown in the red box below.

Thank you very much. Looking forward to your reply!

Anthony Hu · Answer 1 · Tue Aug 24 2021 00:05:13 GMT+0800 (China Standard Time)

Hey!

Thanks for the kind words :)

Let's review the function calculate_birds_eye_view_features.

self.encoder_forward(x). The images inputs of dimension (N, 3, H, W) (N=6 cameras, H=height, W=width) are fed to an encoder to output features of size (N, C, H/8, W/8) because the encoder downsample images by a factor of 8.
geometry = self.get_geometry(intrinsics, extrinsics). We calculate the 3D positions of the downsampled features using the camera intrinsics and extrinsics. If this step is unclear, this blog explains it well: https://ksimek.github.io/2013/08/13/intrinsic/
x = self.projection_to_birds_eye_view(x, geometry). We sum all the 3D features along the vertical dimension to form bird's-eye view features. To do so, we discretise the space around the ego-vehicle in 0.5mx0.5m infinite columns, and the 3D features are summed across the columns. In this particular implementation, which is adapted from https://github.com/nv-tlabs/lift-splat-shoot/blob/master/src/models.py#L200, we use a "ranking trick" to sort the 3D features so that the features that belong to the same infinite column are right next to each other. This speeds up the summing process when backpropagating (see the implementation of VoxelsSumming), compared to a naive "cumulative sum function" which works as well, but is slower.

Hope that was helpful, and good luck for your project!

taylover-pei · Answer 2 · Tue Aug 24 2021 13:04:36 GMT+0800 (China Standard Time)

Hi! Thanks for your replying. It really helps me a lot!

For the self.encoder_forward(x) and geometry = self.get_geometry(intrinsics, extrinsics), I have understood the execution logic of the code. However, for the x = self.projection_to_birds_eye_view(x, geometry), I still have some questions that bother me.

What do the values in the geometry tensor of self.projection_to_birds_eye_view(x, geometry) function stand for? Do the values of geometry tensor represent the coordinate's relationship, which is used during the transformation process from the Frustum Features to Voxel Features? Would you please take some examples for me to make the process clearer?
Since the feature maps are transformed from the front view (Frustum Features) to bird's-eye-view (Voxel Features), where is the transformation process reflected in the self.projection_to_birds_eye_view(x, geometry) function?
How are features of multiple cameras fused in the self.projection_to_birds_eye_view(x, geometry) function?

Thank you very much. These problems have bothered me for a long time. Looking forward to your reply!

Anthony Hu · Answer 3 · Tue Aug 24 2021 16:07:16 GMT+0800 (China Standard Time)

The geometry tensor is aligned with the x tensor, and correspond to the 3D position of each feature. In other words, by using the geometry tensor you can know the 3D location of each element in x, which allows you to know in which voxel each feature belongs to for summing in bird's-eye view.
and 3. The feature maps come from the N=6 cameras. And using the geometry tensor, we can position the feature maps in 3D, in a common reference frame, which is the inertial center of the vehicle.

taylover-pei · Answer 4 · Wed Aug 25 2021 12:35:51 GMT+0800 (China Standard Time)

I have understand it, thank you veru much!

Anthony Hu · Answer 5 · Wed Aug 25 2021 16:12:15 GMT+0800 (China Standard Time)

No problem, glad I could help! :)