clovaai / voxceleb_trainer

In defence of metric learning for speaker recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi node training

ukemamaster opened this issue · comments

Hi @joonson,
Could you please give some hints to make it work for a multi-node multi gpu distributed training?

It will require many changes to the code, including but not limited to:

if args.distributed:
os.environ['MASTER_ADDR']='localhost'
os.environ['MASTER_PORT']=args.port
dist.init_process_group(backend='nccl', world_size=ngpus_per_node, rank=args.gpu)
torch.cuda.set_device(args.gpu)
s.cuda(args.gpu)
s = torch.nn.parallel.DistributedDataParallel(s, device_ids=[args.gpu], find_unused_parameters=True)
print('Loaded the model on GPU {:d}'.format(args.gpu))

We cannot provide support for this at this stage.