Training issues (from pretrained weights)

Question

Training issues (from pretrained weights)

ToniRV opened this issue 5 years ago · comments

First, great work!
Unfortunately, I ran into three weird behaviors when training your network.
I'm trying to fine-tune your network for the kitti dataset, which provides a mere 200 labeled images. Kitti uses the same labels as in CityScapes. Images are 1242 pixels wide and 375 pixels height.

First, I update the train.txt and val.txt, then I relabel images using cityscapes' mapping from labelId to trainId, I then encounter a problem concerning the 255 label, which I simply discard in the main.py. Besides this small inconvenient, everything so far is good (would be nice to add --ignore_id argument in the main script).
When training, regardless of using pretrained weights I would expect to have to use the arguments you provide, basically 'inWidth' and 'inHeight'. Nevertheless, when using such values the training breaks:
Running:

CUDA_VISIBLE_DEVICES=0,1 python3 main.py --batch_size 8 --s 1.0 --data_dir ./kitti --cached_data_file kitti.p --inWidth 1242 --inHeight 375

Outputs:

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 72 and 71 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:83

Similar Issues: #6 #13

Interestingly enough, leaving the default values makes it run...

Even more weirdly, when using the pretrained weights, I encounter the following issue:
Running:

CUDA_VISIBLE_DEVICES=0,1 python3 main.py --batch_size 10 --s 1.0 --data_dir ./kitti --cached_data_file kitti.p --pretrained ./pretrained_weights/espnetv2_segmentation_s_1.0.pth

Outputs:

Traceback (most recent call last):
  File "main.py", line 269, in <module>
    trainValidateSegmentation(parser.parse_args())
  File "main.py", line 30, in trainValidateSegmentation
    model = net.EESPNet_Seg(args.classes, s=args.s, pretrained=args.pretrained, gpus=num_gpus)
  File "segmentation/cnn/SegmentationModel.py", line 25, in __init__
    classificationNet.load_state_dict(torch.load(pretrained))
  File "venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
	Missing key(s) in state_dict: "module.level1.conv.weight", "module.level1.bn.weight", "module.level1.bn.bias", "module.level1.bn.running_mean",  ............A long list....................
	Unexpected key(s) in state_dict: "module.net.level1.conv.weight", "module.net.level1.bn.weight", "module.net.level1.bn.bias", "module.net.level1.bn.running_mean", "module.net.level1.bn.running_var", "module.net.level1.bn.num_batche, ............Another long list....................

As interesting as before, if I run the same command with the pretrained weights of object classification, it runs... (this is quite confusing)
Running:

CUDA_VISIBLE_DEVICES=0,1 python3 main.py --batch_size 8 --s 1.0 --data_dir ./kitti --cached_data_file kitti.p --pretrained ../imagenet/pretrained_weights/espnetv2_s_1.0.pth

Outputs:

Model initialized with pretrained weights
Total network parameters: 340782
Data statistics
[ 98.74988 102.4303   97.34298] [80.21684 78.51919 75.99867]
[ 3.5197217  7.555117   5.793313  10.019754   9.771379   9.226719
 10.181938   9.925934   3.145523   5.8163567  5.3573713 10.378751
  8.682274   6.40663   10.266428  10.30295   10.258763  10.4816475
 10.429212   7.707068 ]
Learning rate: 0.0005
Train: epoch 0
[0/23] loss: 6.070 time:6.91
[1/23] loss: 5.943 time:0.52
[2/23] loss: 5.866 time:0.48
[3/23] loss: 5.786 time:0.49
[4/23] loss: 5.752 time:0.48
[5/23] loss: 5.597 time:0.48
...

How can I train your network using the pretrained weights for segmentation using the kitti dataset?
The problem is that if I run the network on the kitti dataset with the pretrained weights only, it does not look very nice.

Sachin Mehta · Answer 1 · Wed Apr 17 2019 13:50:38 GMT+0800 (China Standard Time)

It is trivial to fix this one. We have fixed this in our new code base, which will be released soon.
Input width and height should be divisible by 16 for ESPNetv2. If they are not, then you will get this error.
Model weights are saved with DataParallel wrapper. If you try to initialize the model without dataparallel wrapper using the pretrained weights, then you will get this error. Try wrapping model with Dataparallel wrapper and the load the pretrained weights. It should work.