yuweihao / MambaOut

MambaOut: Do We Really Need Mamba for Vision? (CVPR 2025)

Repository from Github https://github.comyuweihao/MambaOutRepository from Github https://github.comyuweihao/MambaOut

Train from scratch on my GPUs.

DoranLyong opened this issue · comments

Thanks for sharing your awesome work.
Your works have been a good baseline to me :)

I have some questions about training from scratch to reproduce your results.

Unfortunately, I possess 4 RTX 4090 GPUs, so I adjusted GRAD_ACCUM_STEPS to set ALL_BATCH_SIZE=4096.
However, I couldn't get the same results as you. My results are more degraded than yours (e.g., kobe size model).

So, following your previous work PoolFormer, I set ALL_BATCH_SIZE=1024, --lr 1e-3, --warmup-epochs 5, and keep DROP_PATH=0.025.

Q1) If I tune ALL_BATCH_SIZE, do I also need to change the value of DROP_PATH?
Q2) I wonder if there is a rule to set DROP_PATH depending on model sizes.
Q3) I ask your advice if there are something tips to reproduce your results with a small number of GPUs .

Hi @DoranLyong ,

Thank you so much for your recognition of our work.

A1) In general, no need.
A2) I don't have a clear and specific rule for DROP_PATH. My simple rule is the DROP_PATH of the larger model should be larger or equal to that of the smaller model.
A3) Maybe try EMA, although I am not sure whether it works.

Besides, could you please share the results of the MambaOut-Kobe you trained? I may try to reproduce it on you similar setting to help find the solution.

Thank you so much once again.

Thanks for your concern :)

$\text{Batch Size per GPU} = \frac{\text{Total Batch Size}}{\text{Number of GPUs} \times \text{Gradient Accumulation Steps}}=\frac{4096}{8 \times 4}=128$
This is your setting. So, I try to meet the same batch size per GPU and keep the learning rate like below.

1st try

DATA_PATH=/workspace/dataset/ImageNet2012
CODE_PATH=/workspace/projects/MambaOut # modify code path here


ALL_BATCH_SIZE=4096
NUM_GPU=4
GRAD_ACCUM_STEPS=8 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS


MODEL=mambaout_kobe
DROP_PATH=0.025


cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model $MODEL --opt adamw --lr 4e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path $DROP_PATH

Next, I tried to change the learning rate following below rule.

$\text{New Learning Rate} = \text{Old Learning Rate} \times \frac{\text{New Batch Size per GPU}}{\text{Old Batch Size per GPU}} = 4 \times 10^{-3} \times \frac{64}{128} = 2 \times 10^{-3}$

So, I increase the GRAD_ACCUM_STEPS in double and decrease the learning rate as 2e-3

2nd try

DATA_PATH=/workspace/dataset/ImageNet2012
CODE_PATH=/workspace/projects/MambaOut # modify code path here


ALL_BATCH_SIZE=4096
NUM_GPU=4
GRAD_ACCUM_STEPS=16 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS


MODEL=mambaout_kobe
DROP_PATH=0.025


cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model $MODEL --opt adamw --lr 2e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path $DROP_PATH

I stopped to train at Epoch 249 and the top accuracy rate is 75.5.

In my experience with your MetaFormer baseline training from scratch, the accuracy should be larger than it at this epoch to reach 80.0 accuracy within Epoch 300.

summary.csv

I'd like to share my training log. I have no idea why it saturated early :s

Hi @DoranLyong , I reproduced MambaOut-Kobe on 8 GPUs of A5000 and it can achieve 80.1. Logs see
mambaout_kobe_bs4096_gpu8_ga2_lr4e-3_dp0025.csv.

My environment is
Hardware: 8 GPUs of A5000
PyTorch: 1.11.0
CUDNN: 8
timm: 0.6.11

My training command is

DATA_PATH=/local_home/dataset/imagenet
CODE_PATH=/home/yuweihao/code/MambaOut # modify code path here


ALL_BATCH_SIZE=4096
NUM_GPU=8
GRAD_ACCUM_STEPS=2 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS


MODEL=mambaout_kobe
DROP_PATH=0.025


cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model $MODEL --opt adamw --lr 4e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path $DROP_PATH --native-amp

Therefore, you may try:

  • Set up environment with package versions the same as mine.
  • Set up batch size as 4096 (I did not try other batch sizes on GPUs, so I am not sure whether other batch sizes work).
  • You can also use --native-amp and reduce GRAD_ACCUM_STEPS to accelerate training.

okay I solved my issues :)

  • It was the version issues of pytorch and timm.
  • after matching my env to your env, I could get the same results like you.
  • For pytorch and timm, it seems that there are some minor different between the latest version and the previous version.