mks0601 / I2L-MeshNet_RELEASE

Official PyTorch implementation of "I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image", ECCV 2020

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot reproduce training performance

rawalkhirodkar opened this issue · comments

Hi Gyeongsik,

I am working on reproducing the numbers reported in the paper.
Train dataset: H36M, MuCo, COCO
Test dataset: 3DPW

I am using pytorch 1.8, python 3.8, cuda10


I did two runs. Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage)

  1. Train Batch Size per GPU = 16, Number of GPUs = 4 (this is the default config)
MPJPE from lixel mesh: 96.23 mm
PA MPJPE from lixel mesh: 60.68 mm
  1. Train Batch Size per GPU = 24, Number of GPUs = 8 (bigger batch config)
MPJPE from lixel mesh: 96.37 mm
PA MPJPE from lixel mesh: 61.51 mm

I also trained the bigger batch config (run2) for the param stage.
Here is the performance snapshot17.pth and snapshot15.pth (the best checkpoint) on 3DPW dataset.

snapshot17.pth, param stage
MPJPE from lixel mesh: 95.85 mm
PA MPJPE from lixel mesh: 61.21 mm
MPJPE from param mesh: 98.11 mm
PA MPJPE from param mesh: 61.64 mm
snapshot15.pth, param stage
MPJPE from lixel mesh: 95.65 mm
PA MPJPE from lixel mesh: 60.97 mm
MPJPE from param mesh: 97.22 mm
PA MPJPE from param mesh: 60.82 mm

I am still waiting on the param stage of the default config, will edit this then.
But the reported MPJPE for lixel is 93.2 and it looks unlikely that I will converge there.
Any suggestions? Should I train longer?

Thank you would greatly appreciate your help.

You don't have to train longer. Could you let me know any modifications you made on top of the released codes?

Thank you for the reply. No modifications, right offshelf.

I was also able to reproduce the reported results with the weights shared, so the test data setup is correct.

Did you train the model with the provided SMPLify-X fits of Human3.6M, MSCOCO, and MuCo?

Yes, as I was reproducing results, made sure everything else is identical including the data setup. Here is the param stage train log (lixel stage log is too big to attach here).
param_stage_train.txt
.

That is weird.. I tried training this model few months ago and I successfully reproduced the numbers of the paper. The PA MPJPEs of your trained models are too high.

Could you check MPJPE and PA MPJPE of all snapshots of the lixel stage? It seems you only checked the last snapshot.

Thank you for the suggestion. Here is the performance all snapshots end of epoch 13 to 5 in the lixel stage.

snapshot 12
MPJPE from lixel mesh: 96.23 mm, PA MPJPE from lixel mesh: 60.68 mm
MPJPE from param mesh: 476.56 mm, PA MPJPE from param mesh: 312.22 mm
snapshot 11
MPJPE from lixel mesh: 96.32 mm, PA MPJPE from lixel mesh: 61.05 mm
MPJPE from param mesh: 476.17 mm, PA MPJPE from param mesh: 311.95 mm
snapshot 10
MPJPE from lixel mesh: 97.20 mm, PA MPJPE from lixel mesh: 60.99 mm
MPJPE from param mesh: 476.03 mm, PA MPJPE from param mesh: 312.18 mm
snapshot 9
MPJPE from lixel mesh: 99.54 mm, PA MPJPE from lixel mesh: 62.00 mm
MPJPE from param mesh: 475.52 mm, PA MPJPE from param mesh: 313.29 mm
snapshot 8
MPJPE from lixel mesh: 95.19 mm, PA MPJPE from lixel mesh: 59.96 mm
MPJPE from param mesh: 476.22 mm, PA MPJPE from param mesh: 312.58 mm
snapshot 7
MPJPE from lixel mesh: 100.16 mm, PA MPJPE from lixel mesh: 61.76 mm
MPJPE from param mesh: 475.57 mm, PA MPJPE from param mesh: 313.10 mm
snapshot 6
MPJPE from lixel mesh: 98.81 mm, PA MPJPE from lixel mesh: 61.52 mm
MPJPE from param mesh: 476.44 mm, PA MPJPE from param mesh: 312.91 mm
snapshot 5
MPJPE from lixel mesh: 95.52 mm, PA MPJPE from lixel mesh: 60.33 mm
MPJPE from param mesh: 475.54 mm, PA MPJPE from param mesh: 312.57 mm
snapshot 4
MPJPE from lixel mesh: 100.93 mm, PA MPJPE from lixel mesh: 61.85 mm
MPJPE from param mesh: 474.39 mm, PA MPJPE from param mesh: 314.40 mm

If this line and this line are the same with the pushed ones, I guess there would be no problem in your codes. The PA MPJPEs of snapshots are pretty weird. It seems the modules are not trained because the errors do not change. Could you check results of the default setting, not the bigger batch version?

The results are for the default setting and not the bigger batch versions.
I am using the old get_optimizer bypassing the trainable_modules. This should not make a difference right?

def get_optimizer(self, model):
        if cfg.stage == 'lixel':
            optimizer = torch.optim.Adam(list(model.module.pose_backbone.parameters()) + \
                                        list(model.module.pose_net.parameters()) + \
                                        list(model.module.pose2feat.parameters()) + \
                                        list(model.module.mesh_backbone.parameters()) + \
                                        list(model.module.mesh_net.parameters()), lr=cfg.lr)
            print('The parameters of pose_backbone, pose_net, pose2feat, mesh_backbone, and mesh_net are added to the optimizer.')
        else:
            optimizer = torch.optim.Adam(model.module.param_regressor.parameters(), lr=cfg.lr)
            print('The parameters of all modules are added to the optimizer.')
        return optimizer

I don't think changing those lines to newer ones would make some changes, but could you try?
If you still cannot reproduce the results, well... I can't come up with new solutions.

Thank you for the suggestion. I did a fresh clone. The current code on this repo throws this error (one of the reasons I switched to the older versions)

  File "train.py", line 83, in <module>
    main()
  File "train.py", line 40, in main
    trainer._make_model()
  File "/Desktop/ochmr/lixel_original/main/../common/base.py", line 129, in _make_model
    optimizer = self.get_optimizer(model)
  File "/Desktop/ochmr/lixel_original/main/../common/base.py", line 55, in get_optimizer
    optimizer = torch.optim.Adam(total_params, lr=cfg.lr)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/adam.py", line 48, in __init__
    super(Adam, self).__init__(params, defaults)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/optimizer.py", line 55, in __init__
    self.add_param_group(param_group)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/optimizer.py", line 255, in add_param_group
    raise TypeError("optimizer can only optimize Tensors, "
TypeError: optimizer can only optimize Tensors, but one of the params is Module.parameters

Sorry I changed common/base.py Now it gonna work

commented

I am working on reproducing the numbers reported in the paper.
Train dataset: H36M, MuCo, COCO
Test dataset: 3DPW
I am using pytorch 1.6, python 3.7, cuda10

Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage)
MPJPE from lixel mesh: 99.11 mm
PA MPJPE from lixel mesh: 58.80 mm
MPJPE from param mesh: nan mm
PA MPJPE from param mesh: nan mm

I am working on reproducing the result fo 3DPW.
Train dataset: H36M, COCO
Test dataset: 3DPW
lr_dec_epoch = [10,12]
end_epoch = 13
lr = 1e-4

The performance is as follow and cannot reach the performance in the paper:
MPJPE from lixel mesh:99.05 mm
PA MPJPE from lixel mesh: 62.68 mm

I wonder the training settings are all the same even if I use more data such as MuCo? Or I should use different training setting?

You don't have to train longer. Could you let me know any modifications you made on top of the released codes?

Hi @mks0601 , I thought conventional training epochs are more than 70 or 100 epochs. Why does the code run 10 epochs, much less? Thanks.

We found that the longer training is not necessary