Why are the pre-processed camera extrinsic parameters different from those in the original NeRSemble dataset?

Question

Why are the pre-processed camera extrinsic parameters different from those in the original NeRSemble dataset?

zydmu123 opened this issue 6 months ago · comments

I'd like to know if there is any special adjustment mode to the camera? Thanks a lot!

Shenhan Qian · Answer 1 · Fri Feb 23 2024 17:45:31 GMT+0800 (China Standard Time)

Hi, the extrinsics of raw NeRSemble are camera2world in the OpenCV convention, obtained from COLMAP.

Before FLAME tracking, we convert the extrinsics into camera2world in the OpenGL convention, the same as most synthetic NeRF datasets.

We also apply a global rotation to all cameras to align their mean pose with the world coordinate.

def align_cameras_to_axes(
    R: torch.Tensor,
    T: torch.Tensor,
    target_convention: Literal["opengl", "opencv"] = None,
):
    """align the averaged axes of cameras with the world axes.

    Args:
        R: rotation matrix (N, 3, 3)
        T: translation vector (N, 3)
    """
    # The column vectors of R are the basis vectors of each camera.
    # We construct new bases by taking the mean directions of axes, then use Gram-Schmidt
    # process to make them orthonormal
    bases_c2w = gram_schmidt_orthogonalization(R.mean(0))
    if target_convention == "opengl":
        bases_c2w[:, [1, 2]] *= -1  # flip y and z axes
    elif target_convention == "opencv":
        pass
    bases_w2c = bases_c2w.t()

    # convert the camera poses into the new coordinate system
    R = bases_w2c[None, ...] @ R
    T = bases_w2c[None, ...] @ T
    return R, T

After we get FLAME tracking results, we add a global translation to all cameras and the FLAME mesh so that the mean position of the head in each sequence is at the origin.

NerworkPlace · Answer 2 · Fri Feb 23 2024 21:22:40 GMT+0800 (China Standard Time)

Thanks for your kind reply!@ShenhanQian As you mentioned above，the pre-processed camera extrinsic parameters have been matched with the FLAME mesh after the tracking process. However, it seems still a slight issue converting your preprocessed camera parameters into a pytorch3d's format. Here is my test, by trying this code I can't get the normal matched results through a PerspectiveCameras, did I miss something important?

Shenhan Qian · Answer 3 · Sat Feb 24 2024 22:49:15 GMT+0800 (China Standard Time)

For your reference, here is a code snippet that works with PyTorch3D on our side

c2w = torch.tensor(frame['transform_matrix'])
c2w[:3, [0, 2]] *= -1  # OpenGL to PyTorch3D
w2c = torch.inverse(c2w).float()
w2c[:3, :3] = w2c.clone()[:3, :3].T  # PyTorch3D uses x = XR + t, while OpenGL uses x = RX + t
self.data["world_mats"].append(w2c)

# construct intrinsic matrix
intrinsics = np.zeros((4, 4))
intrinsics[0, 0] = frame['fl_x'] / frame['w'] * 2
# intrinsics[1, 1] = frame['fl_y'] / frame['h'] * 2
intrinsics[1, 1] = frame['fl_y'] / frame['w'] * 2  # NOTE: the NDC space is a cube, so we use the same scale for x and y
intrinsics[0, 2] = -(frame['cx'] / frame['w'] * 2 - 1)
intrinsics[1, 2] = -(frame['cy'] / frame['h'] * 2 - 1)
intrinsics[3, 2] = 1.
intrinsics[2, 3] = 1.

NerworkPlace · Answer 4 · Sun Feb 25 2024 15:55:05 GMT+0800 (China Standard Time)

It works, Thanks a lot!

SSground · Answer 5 · Mon Apr 29 2024 10:30:38 GMT+0800 (China Standard Time)

Hi, the extrinsics of raw NeRSemble are camera2world in the OpenCV convention, obtained from COLMAP.

Before FLAME tracking, we convert the extrinsics into camera2world in the OpenGL convention, the same as most synthetic NeRF datasets.

We also apply a global rotation to all cameras to align their mean pose with the world coordinate.

def align_cameras_to_axes(
    R: torch.Tensor,
    T: torch.Tensor,
    target_convention: Literal["opengl", "opencv"] = None,
):
    """align the averaged axes of cameras with the world axes.

    Args:
        R: rotation matrix (N, 3, 3)
        T: translation vector (N, 3)
    """
    # The column vectors of R are the basis vectors of each camera.
    # We construct new bases by taking the mean directions of axes, then use Gram-Schmidt
    # process to make them orthonormal
    bases_c2w = gram_schmidt_orthogonalization(R.mean(0))
    if target_convention == "opengl":
        bases_c2w[:, [1, 2]] *= -1  # flip y and z axes
    elif target_convention == "opencv":
        pass
    bases_w2c = bases_c2w.t()

    # convert the camera poses into the new coordinate system
    R = bases_w2c[None, ...] @ R
    T = bases_w2c[None, ...] @ T
    return R, T

After we get FLAME tracking results, we add a global translation to all cameras and the FLAME mesh so that the mean position of the head in each sequence is at the origin.

I used the metahuman model. Is it necessary to put the model at the original point?