Training a DeepLabv3 model with output_stride=16, crop_size=513, and batch_size=8 on two 2080Ti GPUs
songkq opened this issue · comments
Hi, I only have two 2080Ti GPUs with memory 11G per gpu
. I'd like to train the baseline deeplabv3 with resnet-101
as backbone and batch_size=8 per gpu
(for 2 gpus, global batch_size=16):
input the gpu (seperate by comma (,) ): 0,1
using gpus 0,1
0 -- deeplabv3
1 -- deeplabv3+
2 -- pspnet
choose the base network: 0
0 -- resnet_v1_50
1 -- resnet_v1_101
2 -- resnet_v1_152
choose the base network: 1
The backbone is resnet101
The base model is deeplabv3
0 -- softmax cross entropy loss.
1 -- sigmoid binary cross entropy loss.
2 -- bce and RMI loss.
3 -- Affinity field loss.
5 -- Pyramid loss.
input the loss type of the first stage: 2
0 -- PASCAL VOC2012 dataset
1 -- Cityscapes
2 -- CamVid
input the dataset: 0
input the batch_size (4, 8, 12 or 16): 8
The data dir is /workspace/data/PASCAL_VOC2012/VOCdevkit/VOC2012, the batch size is 8.
make the directory /workspace/pyroom/RMISegLoss/rmi_model/rmi_re_pascal_r3_pw1_st4_si4_bp513-8_net0-1-0.5_n
Namespace(accumulation_steps=1, backbone='resnet101', base_size=513, batch_size=8, bn_mom=0.05, checkname='deeplab-resnet', crf_iter_steps=1, crop_size=513, cuda=True, data_dir='/workspace/data/PASCAL_VOC2012/VOCdevkit/VOC2012', dataset='pascal', dist_backend='nccl', distributed=True, epochs=23, eval_interval=2, freeze_bn=False, ft=False, gpu_ids=[0, 1], init_global_step=0, init_lr=0.007, local_rank=0, loss_type=2, loss_weight_lambda=0.5, lr_multiplier=10.0, lr_scheduler='poly', main_gpu=0, max_ckpt_nums=15, model_dir='/workspace/pyroom/RMISegLoss/rmi_model/rmi_re_pascal_r3_pw1_st4_si4_bp513-8_net0-1-0.5_n', momentum=0.9, multiprocessing_distributed=False, nesterov=False, no_cuda=False, no_val=False, out_stride=16, output_dir='/home/zhaoshuai/models/deeplabv3_cbl_2/', proc_name='rmi_model/rmi_re_pascal_r3_pw1_st4_si4_bp513-8_net0-1-0.5_n', resume='None', rmi_pool_size=4, rmi_pool_stride=4, rmi_pool_way=1, rmi_radius=3, save_ckpt_steps=500, seed=1, seg_model='deeplabv3', slow_start_lr=0.0001, slow_start_steps=1500, start_epoch=0, sync_bn=True, test_batch_size=8, train_split='trainaug', use_balanced_weights=False, use_sbd=False, weight_decay=0.0001, workers=8, world_size=2)
INFO:PyTorch: Using PASCAL VOC dataset, the training batch size 8 and crop size is 513.
Number of image_lists in trainaug: 10582
Number of image_lists in val: 1449
Restore parameters from the /root/.encoding/models/resnet101-2a57e44d.pth
INFO:PyTorch: Using Region Mutual Information Loss.
INFO:PyTorch: The batch norm layer is Hang Zhang's <class 'model.sync_bn.syncbn.BatchNorm2d'>
INFO:PyTorch: Using poly learning rate scheduler!
INFO:PyTorch: Starting Epoch: 0
INFO:PyTorch: Total Epoches: 23
I wonder if it is equal to train a DeepLabv3 model with output_stride=16, crop_size=513, and batch_size=16 on a single 1 TITAN RTX GPUs
? Will it achieve similar convergence in 23 epochs
.
Does the batch_size matter? If so, how can I adjust other hyperparams with batch_size=8, like epochs, lr as well as the lr_scheduler?
The batch_size
does matter.
If you want to train the baseline with batch_size
16, just set batch_size=16
, the program will divide it by the number of GPUs (it is 2 in your setting). This is done by
Line 133 in e3ada00
Because we use SynchronizedBatchNorm2d
, the results should be the same as batch_size=16 on a single 1 TITAN RTX GPUs
.
In a word, the batch_size
should be the global batch size.
epochs
and lr
will vary with different batch_size
automatically.
Hi, @mzhaoshuai Many thanks.
When using the torch.nn.DataParallel
for multi-gpus training, it will compute the loss within the main gpu (gpu-0), which will results in memory usage is unbalanced for each gpu
. As for this circumstances, I couldn't train the baseline with global batch_size=16
(CUDA out of memory
) with two 2080Ti gpus.
I wonder if using the torch.nn.parallel.DistributedDataParallel
and model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
with batch_size=8 per gpu
for distributed training on two gpus
can also achieve similar results as your reports (train with torch.nn.DataParallel
and model.sync_bn.syncbn.BatchNorm2d
)?
Hi, @mzhaoshuai Many thanks.
When using the
torch.nn.DataParallel
for multi-gpus training, it will compute the loss within the main gpu (gpu-0), which will results inmemory usage is unbalanced for each gpu
. As for this circumstances, I couldn't train the baseline withglobal batch_size=16
(CUDA out of memory
) with two 2080Ti gpus.I wonder if using the
torch.nn.parallel.DistributedDataParallel
andmodel = nn.SyncBatchNorm.convert_sync_batchnorm(model)
withbatch_size=8 per gpu
fordistributed training on two gpus
can also achieve similar results as your reports (train withtorch.nn.DataParallel
andmodel.sync_bn.syncbn.BatchNorm2d
)?
In fact, the loss is calculating on different GPUs in our setting:
Line 69 in e3ada00
Line 206 in e3ada00
However, as you said, there is still unbalanced memory usage of various GPUs.
Theoretically, distributed training on two gpus
can definitely achieve similar results (Sadly, the code in the repo does not support distributed training. You may check the Pytorch/examples
repo for help).
In the end, I have tried to train the baseline on 2 GPUs with 11GB memory (GTX 1080 Ti). There was no OOM
. This is a weird thing.
You can also try 'Accumulating Gradients'. It is slow but saves much GPU memory (Note: it will influence the statistics of BN layer.).
https://discuss.pytorch.org/t/accumulating-gradients/30020