wayveai / fiery

PyTorch code for the paper "FIERY: Future Instance Segmentation in Bird's-Eye view from Surround Monocular Cameras"

Home Page:https://wayve.ai/blog/fiery-future-instance-prediction-birds-eye-view

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the projection_to_birds_eye_view function

taylover-pei opened this issue · comments

Congratulations on your great work!

I want to follow your work for future research and I have some questions about your released code below:

In the fiery.py file of your code, can you provide more details about the get_geometry function and the projection_to_birds_eye_view function? I'm so confused about how they actually work, especially, the code shown in the red box below.
112a26f400a5c24af541a6423977362

Thank you very much. Looking forward to your reply!

Hey!

Thanks for the kind words :)

Let's review the function calculate_birds_eye_view_features.

  1. self.encoder_forward(x). The images inputs of dimension (N, 3, H, W) (N=6 cameras, H=height, W=width) are fed to an encoder to output features of size (N, C, H/8, W/8) because the encoder downsample images by a factor of 8.
  2. geometry = self.get_geometry(intrinsics, extrinsics). We calculate the 3D positions of the downsampled features using the camera intrinsics and extrinsics. If this step is unclear, this blog explains it well: https://ksimek.github.io/2013/08/13/intrinsic/
  3. x = self.projection_to_birds_eye_view(x, geometry). We sum all the 3D features along the vertical dimension to form bird's-eye view features. To do so, we discretise the space around the ego-vehicle in 0.5mx0.5m infinite columns, and the 3D features are summed across the columns. In this particular implementation, which is adapted from https://github.com/nv-tlabs/lift-splat-shoot/blob/master/src/models.py#L200, we use a "ranking trick" to sort the 3D features so that the features that belong to the same infinite column are right next to each other. This speeds up the summing process when backpropagating (see the implementation of VoxelsSumming), compared to a naive "cumulative sum function" which works as well, but is slower.

Hope that was helpful, and good luck for your project!

Hi! Thanks for your replying. It really helps me a lot!

For the self.encoder_forward(x) and geometry = self.get_geometry(intrinsics, extrinsics), I have understood the execution logic of the code. However, for the x = self.projection_to_birds_eye_view(x, geometry), I still have some questions that bother me.

  1. What do the values in the geometry tensor of self.projection_to_birds_eye_view(x, geometry) function stand for? Do the values of geometry tensor represent the coordinate's relationship, which is used during the transformation process from the Frustum Features to Voxel Features? Would you please take some examples for me to make the process clearer?

  2. Since the feature maps are transformed from the front view (Frustum Features) to bird's-eye-view (Voxel Features), where is the transformation process reflected in the self.projection_to_birds_eye_view(x, geometry) function?

  3. How are features of multiple cameras fused in the self.projection_to_birds_eye_view(x, geometry) function?

Thank you very much. These problems have bothered me for a long time. Looking forward to your reply!

  1. The geometry tensor is aligned with the x tensor, and correspond to the 3D position of each feature. In other words, by using the geometry tensor you can know the 3D location of each element in x, which allows you to know in which voxel each feature belongs to for summing in bird's-eye view.
  2. and 3. The feature maps come from the N=6 cameras. And using the geometry tensor, we can position the feature maps in 3D, in a common reference frame, which is the inertial center of the vehicle.

I have understand it, thank you veru much!

No problem, glad I could help! :)