Training on mutiple nodes
d-li14 opened this issue · comments
Thanks for your great work. If there are two machines (each with 8 V100 GPUs) connected with ethernet, without slurm management, then how to run the code with your stated 16 V100 config?
I am not sure how to run it through ethernet connected nodes.
I think you can directly run the training on single machine (with 8 GPUs) with a smaller batch size, e.g.
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 1024 --learning-rate 0.25 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
or
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 512 --learning-rate 0.3 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
It should give similar results.
@jingli9111 Thanks very much for your reply. Anyway, the concern is that it might be too slow to train on a single machine.
@d-li14 The provided main.py script internally uses multiprocessing. In order to use two nodes without slurm, best is to get rid of the multiprocessing.spawn command in the main, and merge the main function with main_worker. I have attached the kind of main function I have for your reference at the bottom.
Once you have that code structure, use multiproc.py from NVIDIA to execute the code (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/multiproc.py). An alternate is to use the internal PyTorch launcher i.e. python -m torch.distributed.launch
, but I always prefer multiproc due to redirection of outputs from other processes to different files, avoiding amalgamation of logs from different processes (each process will print the exact same output otherwise, meaning that each print statement will be printed world_size
times). Furthermore, multiproc.py better handles the generated interrupts across different processes.
It takes the --nnodes parameters, which should be set to 2, as well as --nproc_per_node which should be set to 8. Since we need a way for the two nodes to communicate, you have to assume one node to be the master, and note its IP address. Once you have the IP address, just use that IP address with --master_addr ... for both nodes. So the final command will look something like: python ./multiproc.py --nnodes 2 --nproc_per_node 8 --master_addr ... main.py /path/to/imagenet/ --epochs 1000 --batch-size 2048 --learning-rate 0.2 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
def main():
args = parser.parse_args()
args.ngpus_per_node = torch.cuda.device_count()
# Initialize the distributed environment
args.gpu = 0
args.world_size = 1
args.local_rank = 0
args.distributed = int(os.getenv('WORLD_SIZE', 1)) > 1
args.rank = int(os.getenv('RANK', 0))
if "SLURM_NNODES" in os.environ:
args.local_rank = args.rank % torch.cuda.device_count()
print(f"SLURM tasks/nodes: {os.getenv('SLURM_NTASKS', 1)}/{os.getenv('SLURM_NNODES', 1)}")
elif "WORLD_SIZE" in os.environ:
args.local_rank = int(os.getenv('LOCAL_RANK', 0))
args.gpu = args.local_rank
torch.cuda.set_device(args.gpu)
torch.distributed.init_process_group(backend="nccl", init_method="env://")
args.world_size = torch.distributed.get_world_size()
assert int(os.getenv('WORLD_SIZE', 1)) == args.world_size
print(f"Initializing the environment with {args.world_size} processes | Current process rank: {args.local_rank}")
if args.rank == 0:
args.checkpoint_dir.mkdir(parents=True, exist_ok=True)
stats_file = open(args.checkpoint_dir / 'stats.txt', 'a', buffering=1)
print(' '.join(sys.argv))
print(' '.join(sys.argv), file=stats_file)
gpu = args.gpu
torch.backends.cudnn.benchmark = True
@shoaibahmed Thanks a lot for your reply and detailed instructions! I will try it.
does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100
does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100
Me too. The largest batch size I was able to fit on 8 V100 (with 16GB of memory) was 512, which used a little over 11GB of memory per gpu.