GAP-LAB-CUHK-SZ / Total3DUnderstanding

Implementation of CVPR'20 Oral: Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MGNet pretraining goes wrong

chengzhag opened this issue · comments

Hi Yinyu:

I tried to pretrain MGNet with python main.py configs/mgnet.yaml --mode train and test it with python main.py configs/mgnet.yaml --mode test.

However, after 50 epochs of training, the learning rate quickly reduced to a seemingly unreasonable level of 1e-08 with the best chamfer_loss stuck at 5.67 after the 6th epoch.
log.txt

Also, the test results of the best checkpoint looks like below:
log.txt

Is there anything I missed?

Hi,

I think you should follow our paper to train it by stages, just as TMNet referred in our work. A general strategy is to first train the AtlasNet (by setting tmn_subnetworks=1 in mgnet.yaml). After it converges, loading the weights (by setting weight path in mgnet.yaml), and fix it to train the second stage.

I have trained with tmn_subnetworks set to 1. However, When I was trying to load the weights and fix them to train the second stage, I didn't find an option to fix the loaded weights.

Option 'train.freeze' seems to be able to control which submodule needs to be fixed. But it can't be used to fix the weights of the first stage.

Noticed that apart from the difference of optimizer settings (learning rate 1e-4 vs 1e-3, different scheduler) between the code and the paper, the batch_size settings are different too (2 vs 32). Can I go through README to reproduce the results of the paper or do I need another modification not mentioned in README?

Also, with tmn_subnetworks set to 1, the training loss and testing loss looks like below.
屏幕截图 2020-09-27 181821
屏幕截图 2020-09-27 181849

looks like there is some thing wrong with edge, face, boundary loss.

Hi,

Boundary loss will only work for points on open boundaries, which works at the second stage (tmn_subnetworks =2). So it will be 0s if tmn_subnetworks =1. The first stage means shape deformation and the second stage is for topology modification.

Edge loss is a regularization term to penalize extra-long edges. It will not change much during training.

Face loss is to classify whether a point on edges/faces should be removed.

We will update our README to make it more detailed after our deadline ends. Here is our training strategy, you can also follow the strategy in this work :

We first set 'tmn_subnetworks=1' and turn off the edge classifier by setting 'with_edge_classifier=False' in config.yaml for training (it is equivalent to AtlasNet). After converging, turn on the 'with_edge_classifier=True' to train the edge classifier in the first stage. The above are the modules in the first stage.

After that, we fix the above modules to train the second-stage decoder using this function. You can add a line
self.mesh_reconstruction.module.freeze_by_stage(2, ['decoder'])
at this place and remember to turn on 'with_edge_classifier=True' and 'tmn_subnetworks=2'.

Thanks a lot for your patience and detailed explanation! I'll try the steps and refer to the work.

Hi Yinyu:
I followed the three steps (MGN1, MGN2, MGN3) and got the following results:
image
It seems that the third step didn't improve the chamfer loss at all. Where did I do wrong?

The test Avg_Chamfer of stage 3 is 9.70. Not as good as the 8.36 of your paper and the 8.14 of the downloaded MGNet checkpoint.

Another question is if the Avg_Chamfer provided by your test code is the same metric in your paper? The paper mentioned that an ICP algorithm is applied to the output which is not in the code.
image

I tried another run. Looks like the learning rate of the first step starts to go down 30 epochs later accidentally. Which results in a better chamfer score after the first step:
image

However, the test chamfer becomes worse after the third step, which is strange. The best chamfer I got is 9.07 which is after the second step of training. This is still not so close to your results.

May I get some more tips about the training process? Is there something wrong with my procedure?

@pidan1231239 Sorry about asking not related question. I want to know how did you visualize the training process? Is this written in the source code?

I used weights and biases. Added a few lines of code.

Thank you for your fast reply! I will also give it a try

No problem!

@pidan1231239 Hi, have you reproduced the results reported in the paper? I'm also trying to do it but only got 0.103016 (average chamfer distance).

The downloaded checkpoint can achieve 0.008187 Chamfer loss, which is before ICP alignment and probably not the exact code for the final evaluation.

In my best try, the loss can lower down to 0.01028, with batch size changed to 32 like in the paper. However, I used two GPUs in the second stage and one in others because of the memory limitation. Don't know if there is something I missed.

Noticed that apart from the difference of optimizer settings (learning rate 1e-4 vs 1e-3, different scheduler) between the code and the paper, the batch_size settings are different too (2 vs 32). Can I go through README to reproduce the results of the paper or do I need another modification not mentioned in README?

Hi, the author didn't reply to this question. I am also curious about whether we should follow the batch size and learning rate in the paper or in this GitHub. The batch size, lr and epoch number are all different.