Distributed training on Cifar10
ecoxial2007 opened this issue · comments
Thanks for sharing the code!
I try to train on cifar10 and find some problem: gpu_id and process rank are not match
self.sampler = torch.utils.data.distributed.DistributedSampler(
trn_dataset, rank=cfg.gpu, num_replicas=cfg.world_size, shuffle=True
)
Hi @ecoxial2007
- Can you please share your
world_size
,base_gpu
, and the number of GPUs inside your machine? - Can you please try
gloo
instead of the currentnccl
setting here?
Thanks
Thank you for your reply, the number of GPUs is 10 inside my machine, I have tried:
world_size=4
base_gpu=0
and successfully started training- but
world_size=4
base_gpu=2
meet some problems
Process Process-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "pretrain_main.py", line 78, in train_ddp
trn_pretrain.trn(cfg, model)
File "/home/data/302_self-supervised_learning/liangx/simsiam/trn_pretrain.py", line 106, in trn
train(dataset.trn_loader, model,criterion, optimizer, epoch, cfg, writer=writer)
File "/home/data/302_self-supervised_learning/liangx/simsiam/trainers/pretrain.py", line 37, in train
for i , data in enumerate(train_loader):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 352, in __iter__
return self._get_iterator()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 294, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 827, in __init__
self._reset(loader, first_iter=True)
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 857, in _reset
self._try_put_index()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1091, in _try_put_index
index = self._next_index()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
for idx in self.sampler:
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 111, in __iter__
assert len(indices) == self.num_samples
AssertionError
- I modify
rank=cfg.gpu
torank=cfg.gpu-cfg.gpu
ascifar100
, and there is no problem.
Thanks
Hi @ecoxial2007
Setting rank=cfg.gpu-cfg.gpu
means that all 4 processes are training on a single GPU (GPU_0). You can verify that by checking nvidia-smi
world_size=4 base_gpu=0
means the code is using the GPUs [0,1,2,3]
world_size=4 base_gpu=2
means the code is using the GPUs [2,3,4,5]
Since world_size=4 base_gpu=0
worked successfully while world_size=4 base_gpu=2
did not, I would make sure that all GPUs are working properly. I would try world_size=1 base_gpu=2
, world_size=1 base_gpu=3
,... world_size=1 base_gpu=5
.
BTW, can you please try a large range of ports? e.g. start_port = random.choice(range(12355,12455))
Thanks
Hi @ahmdtaha
Sorry, actually I modified rank=cfg.gpu
to rank=cfg.gpu-cfg.base_gpu
and
setting world_size=4 base_gpu=2
worked sucessfully.
I guess world_size=4 base_gpu=2
using GPUs [2,3,4,5], and process id of DistributedSampler
is different , but I'm not sure.
Thanks.