ahmdtaha / simsiam

Pytorch implementation of Exploring Simple Siamese Representation Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributed training on Cifar10

ecoxial2007 opened this issue · comments

Thanks for sharing the code!
I try to train on cifar10 and find some problem: gpu_id and process rank are not match

        self.sampler = torch.utils.data.distributed.DistributedSampler(
            trn_dataset, rank=cfg.gpu, num_replicas=cfg.world_size, shuffle=True
        )

Hi @ecoxial2007

  1. Can you please share your world_size, base_gpu, and the number of GPUs inside your machine?
  2. Can you please try gloo instead of the current nccl setting here?

Thanks

Thank you for your reply, the number of GPUs is 10 inside my machine, I have tried:

  • world_size=4 base_gpu=0 and successfully started training
  • but world_size=4 base_gpu=2 meet some problems
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "pretrain_main.py", line 78, in train_ddp
    trn_pretrain.trn(cfg, model)
  File "/home/data/302_self-supervised_learning/liangx/simsiam/trn_pretrain.py", line 106, in trn
    train(dataset.trn_loader, model,criterion, optimizer, epoch, cfg, writer=writer)
  File "/home/data/302_self-supervised_learning/liangx/simsiam/trainers/pretrain.py", line 37, in train
    for i , data in enumerate(train_loader):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 827, in __init__
    self._reset(loader, first_iter=True)
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 857, in _reset
    self._try_put_index()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1091, in _try_put_index
    index = self._next_index()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 111, in __iter__
    assert len(indices) == self.num_samples
AssertionError
  • I modify rank=cfg.gpu to rank=cfg.gpu-cfg.gpu as cifar100, and there is no problem.

Thanks

Hi @ecoxial2007

Setting rank=cfg.gpu-cfg.gpu means that all 4 processes are training on a single GPU (GPU_0). You can verify that by checking nvidia-smi

world_size=4 base_gpu=0 means the code is using the GPUs [0,1,2,3]
world_size=4 base_gpu=2 means the code is using the GPUs [2,3,4,5]

Since world_size=4 base_gpu=0 worked successfully while world_size=4 base_gpu=2 did not, I would make sure that all GPUs are working properly. I would try world_size=1 base_gpu=2, world_size=1 base_gpu=3,... world_size=1 base_gpu=5.

BTW, can you please try a large range of ports? e.g. start_port = random.choice(range(12355,12455))

Thanks

Hi @ahmdtaha
Sorry, actually I modified rank=cfg.gpu to rank=cfg.gpu-cfg.base_gpu and
setting world_size=4 base_gpu=2 worked sucessfully.
I guess world_size=4 base_gpu=2 using GPUs [2,3,4,5], and process id of DistributedSampler is different , but I'm not sure.

Thanks.