A very large batchsize requires 64 GPUs

Question

A very large batchsize requires 64 GPUs

Jxu-Thu opened this issue 3 years ago · comments

Thanks for your great codes!
In your paper, running the pre-training experiments needs 64 V100 GPUs.
For research purposes, it is too heavy.

If using a small batch size, the performance would drop? How much? Can you provide any empirical results?

Wonjae Kim · Answer 1 · Wed Jun 09 2021 13:02:15 GMT+0800 (China Standard Time)

Unfortunately, we only experimented with batch_size=4096, thus have no empirical results.
Though I believe the performance will be preserved for lower batch sizes like 2048 or 1024.

For low resource regimes, the published code provides "gradient accumulation" options.
It will automatically compute the steps to accumulate gradients with given per_gpu_batchsize and the number of GPUS. (see https://github.com/dandelin/ViLT/blob/master/run.py#L42-L44)
Theoretically, the gradient accumulation will result in the same output compared to the non-accumulation version. (However, we did not use gradient accumulation for our experiments. So it is not guaranteed.)

Jxu-Thu · Answer 2 · Sun Jun 13 2021 11:37:37 GMT+0800 (China Standard Time)

If I use smaller nodes such as num_gpus=8 num_nodes=1, (batch size 4096, with accum_steps=8) should I modify the other configurations? such as the max_steps?

Wonjae Kim · Answer 3 · Sun Jun 13 2021 16:57:36 GMT+0800 (China Standard Time)

@Jxu-Thu
As far as I know, Pytorch lightning will increase the LightningModule's internal step only if the accumulation is done.
(https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L813)
So you should not change the other configurations for applying the gradient accumulation feature.

Jxu-Thu · Answer 4 · Sun Jun 13 2021 17:11:13 GMT+0800 (China Standard Time)

Many thanks for your kind reply!
I am trying to reproduce the results with 24 V100 GPUs with accu steps 3 and batch size over 4k without modifying any configurations.

Wonjae Kim · Answer 5 · Sun Jun 13 2021 17:15:05 GMT+0800 (China Standard Time)

@Jxu-Thu
Also, please pull the latest commit (#12 (comment))

Jxu-Thu · Answer 6 · Sun Jun 13 2021 17:27:26 GMT+0800 (China Standard Time)

Thanks for your reminder

Jxu-Thu · Answer 7 · Wed Jun 16 2021 15:09:07 GMT+0800 (China Standard Time)

I found a very slow training speed due to numerous training iterations in each epoch. I try to inspect why so many iterations using a small batchsize.
Given the vg+mscoco+gcc+sbu (about 900w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/2392933 [00:00<?, ?it/s]
Given the vg+mscoco (about 500w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/169000 [00:00<?, ?it/s]

Why adding gcc+sbu(only 400w samples) increases the iterations from 16w to 239w?
For vg+mscoco , 32 x 16.9w=500w samples.
However, for vg+mscoco+gcc+sbu, 32 x 239w=7648w. I cannot understand why there are so many iterations.
I carefully check the codes but do not find any clues. Could you help me?

Wonjae Kim · Answer 8 · Wed Jun 16 2021 15:16:27 GMT+0800 (China Standard Time)

@Jxu-Thu could you share the config for each running using the sacred's print_config command? (https://sacred.readthedocs.io/en/stable/command_line.html#print-config)

Jxu-Thu · Answer 9 · Wed Jun 16 2021 15:29:20 GMT+0800 (China Standard Time)

vg+mscoco+gcc+sbu

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg', 'sbu', 'gcc']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

coco+vg

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

Wonjae Kim · Answer 10 · Wed Jun 16 2021 15:40:15 GMT+0800 (China Standard Time)

@Jxu-Thu Thank you.
I'll investigate this issue soon.

Wonjae Kim · Answer 11 · Wed Jun 16 2021 17:39:47 GMT+0800 (China Standard Time)

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32
=> Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'
=> Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset.
Please double-check your arrow files' sanity.

Jxu-Thu · Answer 12 · Wed Jun 16 2021 19:52:43 GMT+0800 (China Standard Time)

Thanks! I make a mistake in the data processing. Once fixing the mistake, I have similar iterations with yours.

Harman Singh · Answer 13 · Tue Sep 27 2022 11:22:26 GMT+0800 (China Standard Time)

Hi,
I am facing an issue where, on increasing the number of gpus and nodes, the number of steps donot change. for eg if I run
python run.py with data_root=/mnt/nfs/dandelin num_gpus=4 num_nodes=8 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'

the number of steps is still nearly 169158, while I believe it should have been reduced to 169k/(4*8). Also I observe that the time taken per epoch while using just 1 gpu, is less than when using 32 gpus.

Has anyone faced these issues before?

Harman Singh · Answer 14 · Wed Sep 28 2022 03:16:09 GMT+0800 (China Standard Time)

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 => Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]' => Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset. Please double-check your arrow files' sanity.

what is the total batch size for this run?