Reproduce 15-1 setup on Pascal VOC

Question

Reproduce 15-1 setup on Pascal VOC

mostafaelaraby opened this issue 3 years ago · comments

Describe the bug
I tried to run the provided pascal VOC script using Apex optimization 01 and everything same as script except i was using a single GPU and hence changed the batch size to 24.
But I got the following results

	1-15	16-20	all
Paper	65.12	21.11	54.64
Code results	58.73	21.6	49.7

To Reproduce
start=date +%s`

START_DATE=$(date '+%Y-%m-%d')

PORT=$((9000 + RANDOM % 1000))
GPU=0
NB_GPU=1
DATA_ROOT=./data
DATASET=voc
TASK=15-5s
NAME=PLOP
METHOD=PLOP
BATCH_SIZE=24
INITIAL_EPOCHS=30
EPOCHS=30
OPTIONS="--checkpoint checkpoints/step/"

RESULTSFILE=results/${START_DATE}${DATASET}${TASK}_${NAME}.csv
rm -f ${RESULTSFILE}

CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 0 --lr 0.01 --epochs ${INITIAL_EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
for step in 1 2 3 4 5
do
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step ${step} --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
done
python3 average_csv.py ${RESULTSFILE}`

Arthur Douillard · Answer 1 · Fri Nov 19 2021 22:26:41 GMT+0800 (China Standard Time)

Hey, I don't have a GPU large enough to try a batch size of 24 on a single GPU so I cannot test your command.

However, multiple people have been able to reproduce 15-1 when using 2 GPUs (#3). Can you try that?

Mostafa ElAraby · Answer 2 · Thu Nov 25 2021 00:07:07 GMT+0800 (China Standard Time)

After using 2 gpus i managed to reproduce the papers' results:

	1-15	16-20	all
Paper	65.12	21.11	54.64
Code results 1 GPU	58.73	21.6	49.7
Code results 2 GPUs	65.2	20.9	54.7

What i have noticed is that both runs with 1 and 2 gpus share the same results with small differences till the last task and on the last task the mIoU of old drops from 65% on previous task to 58% on the other hand with 2 gpus it only drops from 68% to 65%.

@arthurdouillard but i was wondering what is the reason it can be reproduced only with 2 gpus and mixed precision?

Arthur Douillard · Answer 3 · Thu Nov 25 2021 00:22:29 GMT+0800 (China Standard Time)

I think the problem comes from either:

how gradients are accumulated with multiple GPUs vs a single GPU, maybe you need to tune the learning rate for a single GPU
does the asyncBN work differently depending on the number of GPUs?

Mostafa ElAraby · Answer 4 · Thu Nov 25 2021 01:26:25 GMT+0800 (China Standard Time)

Inplace ABN sync is supposed to work in the same way with different number of GPUs
For the learning rate i will try to tune it and see what if this helps replicating the papers' results

Thanks @arthurdouillard

Arthur Douillard · Answer 5 · Tue Mar 01 2022 17:56:40 GMT+0800 (China Standard Time)

Don't hesitatee to reopen this issue if you have new findings. Best,