rabbityl / lepard

[CVPR 2022, Oral] Learning Partial point cloud matching in Rigid and Deformable scenes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some questions about the hyperparameters

qsisi opened this issue · comments

commented

Hello! Here are some questions for the code part on dataset 3DMatch.

  1. The dimensionality of the output from the KPConv backbone is set to 528, theoretically, any number that could be divided by 6 and 4 with no remainder is feasible here? Because in my understanding, divided by 6 is for the rotary positional embedding, and divided by 4 is for the multi-head attention, so why not choose 516 as the output dimensionality? Is there any reason behind this choice?
  2. The code part of rotary positional embedding you implemented is impressive, I noticed that you first voxelize the raw 3d coordinates (but not flooring the output coordinates) to scale the 3d coordinates, then I believe the vol_bnds = [-3.6, -2.4, 1.14] is the minimal coordinates among all 3DMatch train&val&test set? also, does voxelizing(or called scaling) the coordinates before positional embedding could get better results than just utilizing the raw 3d coordinates for positional embedding? Could you give some hints about it?

Thank you very much for your help.

Hi.

  1. The reason is that we think the position code with the same frequency should fall into the same head of the transformer, therefore the feature needs to be divided by 24 (4*6).
  2. The voxel size controls the starting frequencies of the position code. Lower freq. lead to smoother signals that can reflect long-range distance, and higher freq. lead to fluctuating signals that reflect short-range distance.
    Using the raw coordinate lead to freq. that is too high. Fig. 2 in this paper is intuitive https://arxiv.org/pdf/2104.06405.pdf
commented

Thank you for your prompt reply.

Sorry to check again, you mean, by applying voxelization, the voxelized coordinates could lead to a higher freq, resulting smoother signals, right?

Also, does the vol_bnds = [-3.6, -2.4, 1.14] denotes the minimal coordinates among all 3DMatch train&val&test set?

Thanks.

Sorry, I wrote in the wrong way. Voxelization with 0.04 m lead to Lower freq. and smoother signals.

Yes it is the min coordinate.
The vol_bnd means to cancel global translation. This is not neccesarry for rotary positional encoding as it always relveals relative distance. But could affect absolute encoding such as sinusodial.

commented

Thanks.

Now I get your point, by voxelizing, the raw coordinates such as [0.3900, 0.9669, 0.7839] are scaled to [0.3900, 0.9669, 0.7839] / 0.08 = [ 4.875 , 12.08625, 9.79875], which has lower freq, leading to smoother signals.

Also, the voxel_size setting for 3DMatch seems to be 0.08m instead of 0.04m?

commented

Sorry to bother you again, may I ask how to get the exact vol_bnds = [-3.6, -2.4, 1.14] for 3DMatch? because when I iterate through the train&val&test set of 3DMatch(provided by PREDATOR), the min coordinates of them turn out to be [-1.5, -1.5, 0.5], could you give some hints about it?

commented

I remember [-3.6, -2.4, 1.14] was from 4DMatch.
Our positional encoding is relative, i.e. change the starting point does not affect the position encoding.
Therefore, the vol_bnds is not a crutial prams, you can use any number for it.

commented

Thanks for your reply, indeed those boundaries are not crucial in relative positional encoding, I'm currently trying the sparse convolution library which needs min coordinates over all input point clouds, so the min coordinates calculated by myself do not consistent with that in lepard confuses me, not it make senses, anyway, thanks for your reply.