NVlabs / nvdiffrast

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong scene rasterization

csyhping opened this issue · comments

Hi @s-laine , I tried your suggestion about rendering multiple objects, as you mentioned before, but I still need help to get the correct result. Could you please kindly help me about this, I'm not sure where is wrong.

Do I understand correctly that you're trying to reconstruct a 3D mesh for an isolated object from a single image? I don't think that's going to be possible; the task is much too ambiguous unless you had extremely strong, differentiable priors to penalize solutions that should be considered invalid. This kind of reconstruction benefits greatly from having a large set of reference images with known camera positions and segmented-out backgrounds, and it is still not easy. See nvdiffrec for one such solution.

@s-laine . No I'm not trying to reconstruct a 3D mesh. The image is just for reference/explanation. I want to render an image from the scene mesh ( I already have the scene mesh; I do not have to reconstruct one), I want to get something like the GT image ( I expect the render result to be like the gt image, as the chair is in front of the wall)

@s-laine , Sorry for the confusion; let me clarify a little bit. My goal is to render an image from the scene mesh(the 23scene.zip); I already have the mesh (the 23scene.zip). The gt image has nothing to do with this; it is just for reference, which let me know if my rendering is correct. It is just rendering from a scene, I want to get a correct rendering result (the chair is indeed in front of the wall in the scene mesh)

Ok, thanks for the clarification. The problem is the camera model, or rather, the lack of one. You are dividing the vertices' x and y coordinates with z, and then replacing all w coordinates with 1.0 without applying any sort of projection. This turns the model into a flat plane, and the depth test in the rasterizer cannot distinguish which triangles should be in front and which in back.

I don't understand how the camera is supposed to work, but your code suggests that the vertices' z coordinate is actually post-projection w coordinate, and the correct pixel positions are obtained by just dividing x and y by that. Here's a hacky way to reconstruct a z coordinate that can be used for depth test:

vert_cam = torch.mm(pos, K.t())
vert_ndc_new = torch.mm(vert_cam, N.t())
pos = torch.cat([vert_ndc_new, pos[:, 2:3].square()], axis=1)[:, [0, 1, 3, 2]].unsqueeze(0)

Here the post-perspective w coordinate is taken to be just the z coordinate of the input, and a new z coordinate is created by squaring that value, so that after the perspective divide the depth values end up in the range they were in the mesh. If this falls outside -1 ... +1 range, there will be depth clipping, and you need to scale the values accordingly.

The proper way to do this would be appending the w=1.0 coordinate to the input vertex positions and multiplying them with a 4×4 perspective projection matrix (as well as other necessary transformations), creating homogeneous vertex positions that nvdiffrast expects as inputs. But this requires knowing what the view frustum is supposed to be, and now this seems to be somehow baked into the vertex positions of the model.

@s-laine , thanks for your reply. By testing your code, yes I can get the correct rendering now!

I agree with your suggestion about creating homogeneous vertex positions, and thanks to your code, I'm sure I must be wrong here. May I ask you a little more about this?

In my scene, the camera is at [0, 0, 0], so the scene mesh is already under the camera coordinates (which is a little tricky). I've read the examples about the projection matrix in util.py, but I only have the camera's K = [cx, cy, fx, fy], so what I wanted to do is to /z and then transfer by K than to the clip space.

So, how should I modify it to get the correct transfer in this way?

And what if the camera is not at [0, 0, 0]?

If dividing by z is the correct thing to do, that corresponds to a camera whose frustum opens at a 90 degree angle. This corresponds to setting x equal to n in projection() function in samples/torch/util.py. If K is a 2D post-projective transformation to the image, you can formulate it as a 4×4 matrix and multiply by that after the perspective transformation.

Got it ! Thank you soooooo much!!!!