NVlabs / RVT

Lines 146 to 157 in 0b170d7

    
           self.pixel_loc = torch.zeros( 
        
               (self.num_img, 3, self.img_size, self.img_size) 
        
           ) 
        
           self.pixel_loc[:, 0, :, :] = ( 
        
               torch.linspace(-1, 1, self.num_img).unsqueeze(-1).unsqueeze(-1) 
        
           ) 
        
           self.pixel_loc[:, 1, :, :] = ( 
        
               torch.linspace(-1, 1, self.img_size).unsqueeze(0).unsqueeze(-1) 
        
           ) 
        
           self.pixel_loc[:, 2, :, :] = ( 
        
               torch.linspace(-1, 1, self.img_size).unsqueeze(0).unsqueeze(0) 
        
           )

It seems that lines 153 and 156 are giving Cartesian coordinates of a pixel (maybe x and y). However, line 150 does not give anything because it is only related to num_images.

Is it a bug?

Hi @StarCycle,

It is not a bug. That coordinates tells the network to which image a particular pixel belongs. For example, it tells whether a pixel belongs to the front image or the top image since the images are arranged in a fixed order along the self.num_img dimension. Hence, I feel this information could be useful although it could be the case that the network performs fine even without it.

Best,
Ankit

Thank you Ankit!

Oh I get it. It could be useful! But it is still different from the original paper, which says

"Specifically, for each view, we render three image maps with a total of 7 channels: (1) RGB (3 channels), (2) depth (1 channel), and (3) (x, y, z) coordinates of the points in the world frame (3 channels). "

If you use (x, y, z) coordinates, could you please let me know in which line they are calculated? self.pixel_loc[:, 1, :, :] and self.pixel_loc[:, 2, :, :] seems to be (x, y) coordinates in the pixel frame, instead of the world frame

Zhuoheng

The channels for rendering, as mentioned in the paper, are here (RGB + D) and here (x, y, z).

The channels here are in addition to the ones above and are meant to provide position information like in other transformers.

Hope it helps!

Thank you! It's impressive that Pytorch3D can also render coordinates in the world frame! Is it the reason that you chose Pytorch3D, instead of Open3D?

I am not familiar with Open3D but I guess it might be possible to render coordinates with it by feeding in the XYZ instead of RGB. For this project, we chose PyTorch3D because of past experience with it :)

Closing because of inactivity.

	self.pixel_loc = torch.zeros(
	(self.num_img, 3, self.img_size, self.img_size)
	)
	self.pixel_loc[:, 0, :, :] = (
	torch.linspace(-1, 1, self.num_img).unsqueeze(-1).unsqueeze(-1)
	)
	self.pixel_loc[:, 1, :, :] = (
	torch.linspace(-1, 1, self.img_size).unsqueeze(0).unsqueeze(-1)
	)
	self.pixel_loc[:, 2, :, :] = (
	torch.linspace(-1, 1, self.img_size).unsqueeze(0).unsqueeze(0)
	)

Does the line 150 give a Cartesian coordinate of a pixel?