Fail to reproduce the reported result.

Question

Fail to reproduce the reported result.

minghchen opened this issue 6 years ago · comments

Nice work! I test on 2 GPU: 1080 ti with 4 batch on each. It should be the same as 4 GPU with 2 batch on each, because of sync_bn.

I achieve 75.8% miou at stage1, and 76.8% at stage2, which is lower than 79%.
I try the newest code with following cmd, as voc2012_train.sh doesn't fit the updated code.

python3 tools/train_voc.py \
                           --gpu "1,2" \
                           --dataset 'voc2012_aug' \
                           --checkpoint_dir "./log/voc_aug_ba=8_wd=4e-5_iter_max=30k/" \
                           --freeze_bn False \
                           --weight_decay 4e-5 \
                           --lr 0.007 \
                           --output_stride 16 \
                           --iter_max 30000 \
                           --batch_size_per_gpu 4

python3 tools/train_voc.py \
                           --gpu "1,2" \
                           --dataset 'voc2012_aug' \
                           --checkpoint_dir "./log/voc_aug_ba=8_wd=4e-5_iter_max=30k_stage2/" \
                           --freeze_bn True \
                           --pretrained_ckpt_file "./resnet101_16_iamgenet_pre-True_ckpt_file-None_loss_weight_file-None_batch_size-8_base_size-513_crop_size-513_split-train_lr-0.007_iter_max-30000final.pth" \
                           --weight_decay 4e-5 \
                           --lr 0.001 \
                           --output_stride 8 \
                           --iter_max 30000 \
                           --batch_size_per_gpu 4

I only rewrite a part of code associated with checkpoint_dir to load checkpoint correctly.
So, what's wrong with my setting? I will appreciate any advice.

Minghao Chen · Answer 1 · Wed Nov 21 2018 15:35:59 GMT+0800 (China Standard Time)

find batch size needs 16. I will test batch size=16 later.

Shanghua Gao · Answer 2 · Fri Dec 07 2018 09:53:23 GMT+0800 (China Standard Time)

find batch size needs 16. I will test batch size=16 later.

I use batch size =16 on 4 GPU (4 batch on each), but can only achieve 73.82% on miou. Have you produced the reported result?

Shanghua Gao · Answer 3 · Fri Dec 07 2018 15:12:36 GMT+0800 (China Standard Time)

Nice work! I test on 2 GPU: 1080 ti with 4 batch on each. It should be the same as 4 GPU with 2 batch on each, because of sync_bn.

I achieve 75.8% miou at stage1, and 76.8% at stage2, which is lower than 79%.
I try the newest code with following cmd, as voc2012_train.sh doesn't fit the updated code.

python3 tools/train_voc.py \
                           --gpu "1,2" \
                           --dataset 'voc2012_aug' \
                           --checkpoint_dir "./log/voc_aug_ba=8_wd=4e-5_iter_max=30k/" \
                           --freeze_bn False \
                           --weight_decay 4e-5 \
                           --lr 0.007 \
                           --output_stride 16 \
                           --iter_max 30000 \
                           --batch_size_per_gpu 4

python3 tools/train_voc.py \
                           --gpu "1,2" \
                           --dataset 'voc2012_aug' \
                           --checkpoint_dir "./log/voc_aug_ba=8_wd=4e-5_iter_max=30k_stage2/" \
                           --freeze_bn True \
                           --pretrained_ckpt_file "./resnet101_16_iamgenet_pre-True_ckpt_file-None_loss_weight_file-None_batch_size-8_base_size-513_crop_size-513_split-train_lr-0.007_iter_max-30000final.pth" \
                           --weight_decay 4e-5 \
                           --lr 0.001 \
                           --output_stride 8 \
                           --iter_max 30000 \
                           --batch_size_per_gpu 4

I only rewrite a part of code associated with checkpoint_dir to load checkpoint correctly.
So, what's wrong with my setting? I will appreciate any advice.

May I ask what are stage1 and stage2 refer to? I didn't see those phases on the paper or the implementation.

linhua · Answer 4 · Fri Dec 07 2018 21:07:43 GMT+0800 (China Standard Time)

You should run train.py directly rather than voc2012_train.sh that is just a example.

Minghao Chen · Answer 5 · Wed Dec 12 2018 14:38:37 GMT+0800 (China Standard Time)

@gasvn

May I ask what are stage1 and stage2 refer to? I didn't see those phases on the paper or the implementation.

Deeplabv3+ uses the same Training Protocol as deeplabv3.
In the deeplabv3 paper, it says:
Since large batch size is required to train batch normalization parame- ters, we employ output stride = 16 and compute the batch normalization statistics with a batch size of 16. The batch normalization parameters are trained with decay = 0.9997. After training on the trainaug set with 30K iterations and ini- tial learning rate = 0.007, we then freeze batch normalization parameters, employ output stride = 8, and train on the official PASCAL VOC 2012 trainval set for another 30K iterations and smaller base learning rate = 0.001.

jiewen · Answer 6 · Tue Dec 18 2018 17:58:42 GMT+0800 (China Standard Time)

find batch size needs 16. I will test batch size=16 later.

when i use batchsize=6 or8,and get error out of memory. so i use batch=4; the device is 4 GPU1080ti; so how to get large batch size train? if use small batch size ,the syn-BN is not necessary , how do you think this ?

Echo · Answer 7 · Wed Mar 06 2019 15:35:52 GMT+0800 (China Standard Time)

@futurebelongtoML sorry to disturb you , I just want to know have you run test.py successfully? I run this code but keep meeting this error:
ImportError: cannot import name 'cfg'

and I got nothing in the Results and Preweights directory, do you know how to solve this? Hope for your apply!