MGNet pretraining goes wrong

Question

MGNet pretraining goes wrong

chengzhag opened this issue 4 years ago · comments

Hi Yinyu:

I tried to pretrain MGNet with python main.py configs/mgnet.yaml --mode train and test it with python main.py configs/mgnet.yaml --mode test.

However, after 50 epochs of training, the learning rate quickly reduced to a seemingly unreasonable level of 1e-08 with the best chamfer_loss stuck at 5.67 after the 6th epoch.
log.txt

Also, the test results of the best checkpoint looks like below:
log.txt

Is there anything I missed?

Yinyu Nie · Answer 1 · Fri Sep 25 2020 18:21:03 GMT+0800 (China Standard Time)

Hi,

I think you should follow our paper to train it by stages, just as TMNet referred in our work. A general strategy is to first train the AtlasNet (by setting tmn_subnetworks=1 in mgnet.yaml). After it converges, loading the weights (by setting weight path in mgnet.yaml), and fix it to train the second stage.

Cheng Zhang · Answer 2 · Sun Sep 27 2020 17:25:56 GMT+0800 (China Standard Time)

I have trained with tmn_subnetworks set to 1. However, When I was trying to load the weights and fix them to train the second stage, I didn't find an option to fix the loaded weights.

Option 'train.freeze' seems to be able to control which submodule needs to be fixed. But it can't be used to fix the weights of the first stage.

Cheng Zhang · Answer 3 · Sun Sep 27 2020 17:53:51 GMT+0800 (China Standard Time)

Noticed that apart from the difference of optimizer settings (learning rate 1e-4 vs 1e-3, different scheduler) between the code and the paper, the batch_size settings are different too (2 vs 32). Can I go through README to reproduce the results of the paper or do I need another modification not mentioned in README?

Cheng Zhang · Answer 4 · Sun Sep 27 2020 18:20:05 GMT+0800 (China Standard Time)

Also, with tmn_subnetworks set to 1, the training loss and testing loss looks like below.

looks like there is some thing wrong with edge, face, boundary loss.

Yinyu Nie · Answer 5 · Mon Sep 28 2020 00:42:05 GMT+0800 (China Standard Time)

Hi,

Boundary loss will only work for points on open boundaries, which works at the second stage (tmn_subnetworks =2). So it will be 0s if tmn_subnetworks =1. The first stage means shape deformation and the second stage is for topology modification.

Edge loss is a regularization term to penalize extra-long edges. It will not change much during training.

Face loss is to classify whether a point on edges/faces should be removed.

We will update our README to make it more detailed after our deadline ends. Here is our training strategy, you can also follow the strategy in this work :

We first set 'tmn_subnetworks=1' and turn off the edge classifier by setting 'with_edge_classifier=False' in config.yaml for training (it is equivalent to AtlasNet). After converging, turn on the 'with_edge_classifier=True' to train the edge classifier in the first stage. The above are the modules in the first stage.

After that, we fix the above modules to train the second-stage decoder using this function. You can add a line
self.mesh_reconstruction.module.freeze_by_stage(2, ['decoder'])
at this place and remember to turn on 'with_edge_classifier=True' and 'tmn_subnetworks=2'.

Cheng Zhang · Answer 6 · Mon Sep 28 2020 08:05:41 GMT+0800 (China Standard Time)

Thanks a lot for your patience and detailed explanation! I'll try the steps and refer to the work.

Cheng Zhang · Answer 7 · Tue Oct 13 2020 08:36:26 GMT+0800 (China Standard Time)

Hi Yinyu:
I followed the three steps (MGN1, MGN2, MGN3) and got the following results:

It seems that the third step didn't improve the chamfer loss at all. Where did I do wrong?

The test Avg_Chamfer of stage 3 is 9.70. Not as good as the 8.36 of your paper and the 8.14 of the downloaded MGNet checkpoint.

Another question is if the Avg_Chamfer provided by your test code is the same metric in your paper? The paper mentioned that an ICP algorithm is applied to the output which is not in the code.

Cheng Zhang · Answer 8 · Wed Oct 14 2020 21:09:53 GMT+0800 (China Standard Time)

I tried another run. Looks like the learning rate of the first step starts to go down 30 epochs later accidentally. Which results in a better chamfer score after the first step:

However, the test chamfer becomes worse after the third step, which is strange. The best chamfer I got is 9.07 which is after the second step of training. This is still not so close to your results.

May I get some more tips about the training process? Is there something wrong with my procedure?

WenM1222 · Answer 9 · Wed Nov 25 2020 19:03:35 GMT+0800 (China Standard Time)

@pidan1231239 Sorry about asking not related question. I want to know how did you visualize the training process? Is this written in the source code?

Cheng Zhang · Answer 10 · Wed Nov 25 2020 19:06:13 GMT+0800 (China Standard Time)

I used weights and biases. Added a few lines of code.

WenM1222 · Answer 11 · Wed Nov 25 2020 19:22:45 GMT+0800 (China Standard Time)

Thank you for your fast reply! I will also give it a try

Cheng Zhang · Answer 12 · Wed Nov 25 2020 19:23:34 GMT+0800 (China Standard Time)

No problem!

Xianghui Yang · Answer 13 · Wed Dec 02 2020 18:51:02 GMT+0800 (China Standard Time)

@pidan1231239 Hi, have you reproduced the results reported in the paper? I'm also trying to do it but only got 0.103016 (average chamfer distance).

Cheng Zhang · Answer 14 · Wed Dec 02 2020 19:09:18 GMT+0800 (China Standard Time)

The downloaded checkpoint can achieve 0.008187 Chamfer loss, which is before ICP alignment and probably not the exact code for the final evaluation.

In my best try, the loss can lower down to 0.01028, with batch size changed to 32 like in the paper. However, I used two GPUs in the second stage and one in others because of the memory limitation. Don't know if there is something I missed.

Cindy0725 · Answer 15 · Fri May 19 2023 10:09:08 GMT+0800 (China Standard Time)

Noticed that apart from the difference of optimizer settings (learning rate 1e-4 vs 1e-3, different scheduler) between the code and the paper, the batch_size settings are different too (2 vs 32). Can I go through README to reproduce the results of the paper or do I need another modification not mentioned in README?

Hi, the author didn't reply to this question. I am also curious about whether we should follow the batch size and learning rate in the paper or in this GitHub. The batch size, lr and epoch number are all different.