clovaai / voxceleb_trainer

Hi @joonson,
Could you please give some hints to make it work for a multi-node multi gpu distributed training?

It will require many changes to the code, including but not limited to:

Lines 120 to 131 in 343af8b

    
           if args.distributed: 
        
               os.environ['MASTER_ADDR']='localhost' 
        
               os.environ['MASTER_PORT']=args.port 
        
               dist.init_process_group(backend='nccl', world_size=ngpus_per_node, rank=args.gpu) 
        
               torch.cuda.set_device(args.gpu) 
        
               s.cuda(args.gpu) 
        
               s = torch.nn.parallel.DistributedDataParallel(s, device_ids=[args.gpu], find_unused_parameters=True) 
        
               print('Loaded the model on GPU {:d}'.format(args.gpu))

We cannot provide support for this at this stage.

	if args.distributed:
	os.environ['MASTER_ADDR']='localhost'
	os.environ['MASTER_PORT']=args.port

	dist.init_process_group(backend='nccl', world_size=ngpus_per_node, rank=args.gpu)

	torch.cuda.set_device(args.gpu)
	s.cuda(args.gpu)

	s = torch.nn.parallel.DistributedDataParallel(s, device_ids=[args.gpu], find_unused_parameters=True)

	print('Loaded the model on GPU {:d}'.format(args.gpu))

Multi node training