Cannot reproduce training performance

Question

Cannot reproduce training performance

rawalkhirodkar opened this issue 3 years ago · comments

Rawal Khirodkar commented 3 years ago

Hi Gyeongsik,

I am working on reproducing the numbers reported in the paper.
Train dataset: H36M, MuCo, COCO
Test dataset: 3DPW

I am using pytorch 1.8, python 3.8, cuda10

I did two runs. Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage)

Train Batch Size per GPU = 16, Number of GPUs = 4 (this is the default config)

MPJPE from lixel mesh: 96.23 mm
PA MPJPE from lixel mesh: 60.68 mm

Train Batch Size per GPU = 24, Number of GPUs = 8 (bigger batch config)

MPJPE from lixel mesh: 96.37 mm
PA MPJPE from lixel mesh: 61.51 mm

I also trained the bigger batch config (run2) for the param stage.
Here is the performance snapshot17.pth and snapshot15.pth (the best checkpoint) on 3DPW dataset.

snapshot17.pth, param stage
MPJPE from lixel mesh: 95.85 mm
PA MPJPE from lixel mesh: 61.21 mm
MPJPE from param mesh: 98.11 mm
PA MPJPE from param mesh: 61.64 mm

snapshot15.pth, param stage
MPJPE from lixel mesh: 95.65 mm
PA MPJPE from lixel mesh: 60.97 mm
MPJPE from param mesh: 97.22 mm
PA MPJPE from param mesh: 60.82 mm

I am still waiting on the param stage of the default config, will edit this then.
But the reported MPJPE for lixel is 93.2 and it looks unlikely that I will converge there.
Any suggestions? Should I train longer?

Thank you would greatly appreciate your help.

Gyeongsik Moon · Answer 1 · Tue Sep 21 2021 12:33:34 GMT+0800 (China Standard Time)

You don't have to train longer. Could you let me know any modifications you made on top of the released codes?

Rawal Khirodkar · Answer 2 · Tue Sep 21 2021 13:20:30 GMT+0800 (China Standard Time)

Thank you for the reply. No modifications, right offshelf.

I was also able to reproduce the reported results with the weights shared, so the test data setup is correct.

Gyeongsik Moon · Answer 3 · Wed Sep 22 2021 20:50:25 GMT+0800 (China Standard Time)

Did you train the model with the provided SMPLify-X fits of Human3.6M, MSCOCO, and MuCo?

Rawal Khirodkar · Answer 4 · Wed Sep 22 2021 23:35:46 GMT+0800 (China Standard Time)

Yes, as I was reproducing results, made sure everything else is identical including the data setup. Here is the param stage train log (lixel stage log is too big to attach here).
param_stage_train.txt
.

Gyeongsik Moon · Answer 5 · Thu Sep 23 2021 01:16:59 GMT+0800 (China Standard Time)

That is weird.. I tried training this model few months ago and I successfully reproduced the numbers of the paper. The PA MPJPEs of your trained models are too high.

Gyeongsik Moon · Answer 6 · Thu Sep 23 2021 01:22:06 GMT+0800 (China Standard Time)

Could you check MPJPE and PA MPJPE of all snapshots of the lixel stage? It seems you only checked the last snapshot.

Rawal Khirodkar · Answer 7 · Thu Sep 23 2021 03:42:19 GMT+0800 (China Standard Time)

Thank you for the suggestion. Here is the performance all snapshots end of epoch 13 to 5 in the lixel stage.

snapshot 12
MPJPE from lixel mesh: 96.23 mm, PA MPJPE from lixel mesh: 60.68 mm
MPJPE from param mesh: 476.56 mm, PA MPJPE from param mesh: 312.22 mm

snapshot 11
MPJPE from lixel mesh: 96.32 mm, PA MPJPE from lixel mesh: 61.05 mm
MPJPE from param mesh: 476.17 mm, PA MPJPE from param mesh: 311.95 mm

snapshot 10
MPJPE from lixel mesh: 97.20 mm, PA MPJPE from lixel mesh: 60.99 mm
MPJPE from param mesh: 476.03 mm, PA MPJPE from param mesh: 312.18 mm

snapshot 9
MPJPE from lixel mesh: 99.54 mm, PA MPJPE from lixel mesh: 62.00 mm
MPJPE from param mesh: 475.52 mm, PA MPJPE from param mesh: 313.29 mm

snapshot 8
MPJPE from lixel mesh: 95.19 mm, PA MPJPE from lixel mesh: 59.96 mm
MPJPE from param mesh: 476.22 mm, PA MPJPE from param mesh: 312.58 mm

snapshot 7
MPJPE from lixel mesh: 100.16 mm, PA MPJPE from lixel mesh: 61.76 mm
MPJPE from param mesh: 475.57 mm, PA MPJPE from param mesh: 313.10 mm

snapshot 6
MPJPE from lixel mesh: 98.81 mm, PA MPJPE from lixel mesh: 61.52 mm
MPJPE from param mesh: 476.44 mm, PA MPJPE from param mesh: 312.91 mm

snapshot 5
MPJPE from lixel mesh: 95.52 mm, PA MPJPE from lixel mesh: 60.33 mm
MPJPE from param mesh: 475.54 mm, PA MPJPE from param mesh: 312.57 mm

snapshot 4
MPJPE from lixel mesh: 100.93 mm, PA MPJPE from lixel mesh: 61.85 mm
MPJPE from param mesh: 474.39 mm, PA MPJPE from param mesh: 314.40 mm

Gyeongsik Moon · Answer 8 · Thu Sep 23 2021 08:30:17 GMT+0800 (China Standard Time)

If this line and this line are the same with the pushed ones, I guess there would be no problem in your codes. The PA MPJPEs of snapshots are pretty weird. It seems the modules are not trained because the errors do not change. Could you check results of the default setting, not the bigger batch version?

Rawal Khirodkar · Answer 9 · Thu Sep 23 2021 08:58:43 GMT+0800 (China Standard Time)

The results are for the default setting and not the bigger batch versions.
I am using the old get_optimizer bypassing the trainable_modules. This should not make a difference right?

def get_optimizer(self, model):
        if cfg.stage == 'lixel':
            optimizer = torch.optim.Adam(list(model.module.pose_backbone.parameters()) + \
                                        list(model.module.pose_net.parameters()) + \
                                        list(model.module.pose2feat.parameters()) + \
                                        list(model.module.mesh_backbone.parameters()) + \
                                        list(model.module.mesh_net.parameters()), lr=cfg.lr)
            print('The parameters of pose_backbone, pose_net, pose2feat, mesh_backbone, and mesh_net are added to the optimizer.')
        else:
            optimizer = torch.optim.Adam(model.module.param_regressor.parameters(), lr=cfg.lr)
            print('The parameters of all modules are added to the optimizer.')
        return optimizer

Gyeongsik Moon · Answer 10 · Thu Sep 23 2021 19:23:48 GMT+0800 (China Standard Time)

I don't think changing those lines to newer ones would make some changes, but could you try?
If you still cannot reproduce the results, well... I can't come up with new solutions.

Rawal Khirodkar · Answer 11 · Sat Sep 25 2021 10:45:18 GMT+0800 (China Standard Time)

Thank you for the suggestion. I did a fresh clone. The current code on this repo throws this error (one of the reasons I switched to the older versions)

  File "train.py", line 83, in <module>
    main()
  File "train.py", line 40, in main
    trainer._make_model()
  File "/Desktop/ochmr/lixel_original/main/../common/base.py", line 129, in _make_model
    optimizer = self.get_optimizer(model)
  File "/Desktop/ochmr/lixel_original/main/../common/base.py", line 55, in get_optimizer
    optimizer = torch.optim.Adam(total_params, lr=cfg.lr)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/adam.py", line 48, in __init__
    super(Adam, self).__init__(params, defaults)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/optimizer.py", line 55, in __init__
    self.add_param_group(param_group)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/optimizer.py", line 255, in add_param_group
    raise TypeError("optimizer can only optimize Tensors, "
TypeError: optimizer can only optimize Tensors, but one of the params is Module.parameters

Gyeongsik Moon · Answer 12 · Sat Sep 25 2021 10:52:02 GMT+0800 (China Standard Time)

Sorry I changed common/base.py Now it gonna work

L · Answer 13 · Fri Oct 22 2021 11:42:25 GMT+0800 (China Standard Time)

I am working on reproducing the numbers reported in the paper.
Train dataset: H36M, MuCo, COCO
Test dataset: 3DPW
I am using pytorch 1.6, python 3.7, cuda10

Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage)
MPJPE from lixel mesh: 99.11 mm
PA MPJPE from lixel mesh: 58.80 mm
MPJPE from param mesh: nan mm
PA MPJPE from param mesh: nan mm

Cakin-Kwong · Answer 14 · Fri Mar 04 2022 13:02:24 GMT+0800 (China Standard Time)

I am working on reproducing the result fo 3DPW.
Train dataset: H36M, COCO
Test dataset: 3DPW
lr_dec_epoch = [10,12]
end_epoch = 13
lr = 1e-4

The performance is as follow and cannot reach the performance in the paper：
MPJPE from lixel mesh:99.05 mm
PA MPJPE from lixel mesh: 62.68 mm

I wonder the training settings are all the same even if I use more data such as MuCo? Or I should use different training setting？

Glory Chen · Answer 15 · Sat Jan 14 2023 13:30:47 GMT+0800 (China Standard Time)

You don't have to train longer. Could you let me know any modifications you made on top of the released codes?

Hi @mks0601 , I thought conventional training epochs are more than 70 or 100 epochs. Why does the code run 10 epochs, much less? Thanks.

Gyeongsik Moon · Answer 16 · Sat Jan 14 2023 16:18:38 GMT+0800 (China Standard Time)

We found that the longer training is not necessary