Training on mutiple nodes

Question

Training on mutiple nodes

d-li14 opened this issue 4 years ago · comments

Thanks for your great work. If there are two machines (each with 8 V100 GPUs) connected with ethernet, without slurm management, then how to run the code with your stated 16 V100 config?

Li Jing · Answer 1 · Sat Apr 03 2021 21:50:12 GMT+0800 (China Standard Time)

I am not sure how to run it through ethernet connected nodes.
I think you can directly run the training on single machine (with 8 GPUs) with a smaller batch size, e.g.
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 1024 --learning-rate 0.25 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
or
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 512 --learning-rate 0.3 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
It should give similar results.

Duo Li · Answer 2 · Sat Apr 03 2021 22:39:43 GMT+0800 (China Standard Time)

@jingli9111 Thanks very much for your reply. Anyway, the concern is that it might be too slow to train on a single machine.

Shoaib Ahmed Siddiqui · Answer 3 · Mon Apr 05 2021 08:40:10 GMT+0800 (China Standard Time)

@d-li14 The provided main.py script internally uses multiprocessing. In order to use two nodes without slurm, best is to get rid of the multiprocessing.spawn command in the main, and merge the main function with main_worker. I have attached the kind of main function I have for your reference at the bottom.

Once you have that code structure, use multiproc.py from NVIDIA to execute the code (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/multiproc.py). An alternate is to use the internal PyTorch launcher i.e. python -m torch.distributed.launch, but I always prefer multiproc due to redirection of outputs from other processes to different files, avoiding amalgamation of logs from different processes (each process will print the exact same output otherwise, meaning that each print statement will be printed world_size times). Furthermore, multiproc.py better handles the generated interrupts across different processes.

It takes the --nnodes parameters, which should be set to 2, as well as --nproc_per_node which should be set to 8. Since we need a way for the two nodes to communicate, you have to assume one node to be the master, and note its IP address. Once you have the IP address, just use that IP address with --master_addr ... for both nodes. So the final command will look something like: python ./multiproc.py --nnodes 2 --nproc_per_node 8 --master_addr ... main.py /path/to/imagenet/ --epochs 1000 --batch-size 2048 --learning-rate 0.2 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024

def main():
    args = parser.parse_args()
    args.ngpus_per_node = torch.cuda.device_count()
    
    # Initialize the distributed environment
    args.gpu = 0
    args.world_size = 1
    args.local_rank = 0
    args.distributed = int(os.getenv('WORLD_SIZE', 1)) > 1
    args.rank = int(os.getenv('RANK', 0))

    if "SLURM_NNODES" in os.environ:
        args.local_rank = args.rank % torch.cuda.device_count()
        print(f"SLURM tasks/nodes: {os.getenv('SLURM_NTASKS', 1)}/{os.getenv('SLURM_NNODES', 1)}")
    elif "WORLD_SIZE" in os.environ:
        args.local_rank = int(os.getenv('LOCAL_RANK', 0))

    args.gpu = args.local_rank
    torch.cuda.set_device(args.gpu)
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    args.world_size = torch.distributed.get_world_size()
    assert int(os.getenv('WORLD_SIZE', 1)) == args.world_size
    print(f"Initializing the environment with {args.world_size} processes | Current process rank: {args.local_rank}")

    if args.rank == 0:
        args.checkpoint_dir.mkdir(parents=True, exist_ok=True)
        stats_file = open(args.checkpoint_dir / 'stats.txt', 'a', buffering=1)
        print(' '.join(sys.argv))
        print(' '.join(sys.argv), file=stats_file)

    gpu = args.gpu
    torch.backends.cudnn.benchmark = True

Duo Li · Answer 4 · Mon Apr 05 2021 19:50:02 GMT+0800 (China Standard Time)

@shoaibahmed Thanks a lot for your reply and detailed instructions! I will try it.

Kartik Gupta · Answer 5 · Thu Apr 29 2021 17:58:45 GMT+0800 (China Standard Time)

does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100

Jure Zbontar · Answer 6 · Mon May 03 2021 17:00:17 GMT+0800 (China Standard Time)

does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100

Me too. The largest batch size I was able to fit on 8 V100 (with 16GB of memory) was 512, which used a little over 11GB of memory per gpu.