Train from scratch on my GPUs.
DoranLyong opened this issue · comments
Thanks for sharing your awesome work.
Your works have been a good baseline to me :)
I have some questions about training from scratch to reproduce your results.
Unfortunately, I possess 4 RTX 4090 GPUs, so I adjusted GRAD_ACCUM_STEPS
to set ALL_BATCH_SIZE=4096
.
However, I couldn't get the same results as you. My results are more degraded than yours (e.g., kobe size model
).
So, following your previous work PoolFormer
, I set ALL_BATCH_SIZE=1024
, --lr 1e-3
, --warmup-epochs 5
, and keep DROP_PATH=0.025
.
Q1) If I tune ALL_BATCH_SIZE
, do I also need to change the value of DROP_PATH
?
Q2) I wonder if there is a rule to set DROP_PATH
depending on model sizes.
Q3) I ask your advice if there are something tips to reproduce your results with a small number of GPUs .
Hi @DoranLyong ,
Thank you so much for your recognition of our work.
A1) In general, no need.
A2) I don't have a clear and specific rule for DROP_PATH
. My simple rule is the DROP_PATH
of the larger model should be larger or equal to that of the smaller model.
A3) Maybe try EMA, although I am not sure whether it works.
Besides, could you please share the results of the MambaOut-Kobe you trained? I may try to reproduce it on you similar setting to help find the solution.
Thank you so much once again.
Thanks for your concern :)
This is your setting. So, I try to meet the same batch size per GPU and keep the learning rate like below.
1st try
DATA_PATH=/workspace/dataset/ImageNet2012
CODE_PATH=/workspace/projects/MambaOut # modify code path here
ALL_BATCH_SIZE=4096
NUM_GPU=4
GRAD_ACCUM_STEPS=8 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS
MODEL=mambaout_kobe
DROP_PATH=0.025
cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model $MODEL --opt adamw --lr 4e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path $DROP_PATH
Next, I tried to change the learning rate following below rule.
So, I increase the GRAD_ACCUM_STEPS
in double and decrease the learning rate as 2e-3
2nd try
DATA_PATH=/workspace/dataset/ImageNet2012
CODE_PATH=/workspace/projects/MambaOut # modify code path here
ALL_BATCH_SIZE=4096
NUM_GPU=4
GRAD_ACCUM_STEPS=16 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS
MODEL=mambaout_kobe
DROP_PATH=0.025
cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model $MODEL --opt adamw --lr 2e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path $DROP_PATH
I stopped to train at Epoch 249
and the top accuracy rate is 75.5
.
In my experience with your MetaFormer baseline
training from scratch, the accuracy should be larger than it at this epoch to reach 80.0
accuracy within Epoch 300
.
I'd like to share my training log. I have no idea why it saturated early :s
Hi @DoranLyong , I reproduced MambaOut-Kobe on 8 GPUs of A5000 and it can achieve 80.1. Logs see
mambaout_kobe_bs4096_gpu8_ga2_lr4e-3_dp0025.csv.
My environment is
Hardware: 8 GPUs of A5000
PyTorch: 1.11.0
CUDNN: 8
timm: 0.6.11
My training command is
DATA_PATH=/local_home/dataset/imagenet
CODE_PATH=/home/yuweihao/code/MambaOut # modify code path here
ALL_BATCH_SIZE=4096
NUM_GPU=8
GRAD_ACCUM_STEPS=2 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS
MODEL=mambaout_kobe
DROP_PATH=0.025
cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model $MODEL --opt adamw --lr 4e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path $DROP_PATH --native-amp
Therefore, you may try:
- Set up environment with package versions the same as mine.
- Set up batch size as 4096 (I did not try other batch sizes on GPUs, so I am not sure whether other batch sizes work).
- You can also use
--native-amp
and reduceGRAD_ACCUM_STEPS
to accelerate training.
okay I solved my issues :)
- It was the version issues of pytorch and timm.
- after matching my env to your env, I could get the same results like you.
- For pytorch and timm, it seems that there are some minor different between the latest version and the previous version.