dougsouza / pytorch-sync-batchnorm-example

How to use Cross Replica / Synchronized Batchnorm in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

can you give a complete example?

yzl96 opened this issue · comments

commented

for example, I want to use gpu2, gpu4, gpu5 to train the net, how can I set ?

@yzl96, you can set the CUDA_VISIBLE_DEVICES environment variable to mask the GPUs you want. In your case, you could do the following:

$ export CUDA_VISIBLE_DEVICES=2,4,5
$ python your_train_script.py

In your train script, the available GPUs will be 0,1,2, because that's the GPUs that are available to it.

commented

`import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed
import torchvision
import torchvision.transforms as transforms
import numpy as np
import os
#import argparse

batch_size = 64
#os.environ['CUDA_VISIBLE_DEVICES'] = '3,4,5'
#parser.add_argument('--local_rank', type=int, default=3)

device = torch.device('cuda:{}'.format(3))

train_set = torchvision.datasets.FashionMNIST(
root='./data/FashionMNIST',
train=True,
download=True,
transform=transforms.Compose([transforms.ToTensor(),])
)

train_loader = torch.utils.data.DataLoader(train_set,batch_size=batch_size,shuffle=True)

test_set = torchvision.datasets.FashionMNIST(
root='./data/FashionMNIST',
train=False,
download=True,
transform=transforms.Compose([transforms.ToTensor()])
)

test_loader = torch.utils.data.DataLoader(test_set,batch_size=batch_size,shuffle=True)

class Lenet(nn.Module):
def init(self):
super(Lenet,self).init()
self.conv1 = nn.Conv2d(in_channels=1,out_channels=6,kernel_size=5,stride=1)
self.conv2 = nn.Conv2d(in_channels=6,out_channels=16,kernel_size=5,stride=1)
self.conv3 = nn.Conv2d(in_channels=16,out_channels=32,kernel_size=4,stride=1)

    self.fc1 = nn.Linear(in_features=32,out_features=16,bias=True)
    self.fc2 = nn.Linear(in_features=16,out_features=10,bias=True)
    
def forward(self,x):
    x = self.conv1(x)
    x = F.max_pool2d(x,2)
    x = F.relu(x)
    x = self.conv2(x)
    x = F.max_pool2d(x,2)
    x = F.relu(x)
    x = self.conv3(x)
    x = F.relu(x)
    x = self.fc1(x)
    x = self.fc2(x)
    
    return x        

net = Lenet()

net = net.to(device)

world_size = args.ngpu
torch.distributed.init_process_group(
'nccl',
init_method='env://',
world_size=world_size,
rank=3,
)

net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)

net = torch.nn.parallel.DistributedDataParallel(
net,
device_ids=[3],
output_device=3,
)

sampler = torch.utils.data.distributed.DistributedSampler(
train_set,
num_replicas=config.ngpu,
rank=3,
)
data_loader = DataLoader(
train_set,
batch_size=batch_size,
num_workers=8,
pin_memory=True,
sampler=sampler,
drop_last=True,
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

for it, (inpu, target) in enumerate(data_loader):
inpu, target = inpu.to(device), target.to(device)
optimizer.zero_grad()
outputs = net(inpu)
loss = criterion(outputs,labels)
loss.backward()
optimizer.step()

`

@dougsouza , I set export CUDA_VISIBLE_DEVICES=3,4,5
according to your markdown, I write the code above, the terminal output is below:

File "multi_gpu.py", line 62, in
net = net.to(device)
net = net.to(device)
net = net.to(device)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 381, in to
RuntimeError: CUDA error: invalid device ordinal
return self._apply(convert)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 381, in to
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return self._apply(convert)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: module._apply(fn)
CUDA error: invalid device ordinal
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

my gpu 3 4 5 is available, I set the local_rank 3.

@yzl96, when you set the CUDA_VISIBLE_DEVICES=3,4,5 your saying there are 3 gpus available. So your program sees 3 gpus: 0,1,2, therefore index 3 is out of bounds.

commented

It seems like three process all use gpu3, how to set different gpu for different process, I see lots of people use multiprocess.spawn, what I should do to set different gpu for different process.

commented

OK, I make a mistake , so 0 means gpu3?

@yzl96, yes, 0 is probably gpu3.

About launching the processes, you are correct. There are two ways to launch the processes: using multiporcess.spawn or using torch.distributed.launch. In the example in here we launch using the second approach. In this case you need to parse the local_rank argument, you then use this argument to send your model and data to for its device. For example, you could do the following:

export CUDA_VISIBLE_DEVICES=2,4,5
$ python -m torch.distributed.launch --nproc_per_node=3 your_train_script.py \
--arg1=arg1 --arg2=arg2 --arg3=arg3 --arg4=arg4 --argn=argn

In this case, your your_train_script.py will be executed 3 times, each time receiving a different local_rank. First to be launched wiil be rank 0, second rank 1 and third rank 2.

commented

OK,thanks,i should go to class,i will give it a try later

commented

@dougsouza It works now, thanks for your help,bro.

commented

@dougsouza , the code can work, but there are something I 'm confused, the gpu0 occupy more memory than other gpu, when I use 3 gpus, the memory gpu0 occupy is three times of gpu1 and gpu2, when I use 2 gpus, the memory gpu0 occupy is two times of gpu1.

@yzl96, you need to check in your code if you're sending the model an data correctly to each GPU. Also, I notice as well that the GPU 0 (master) uses more memory, I don't know yet if that's normal behaviour (makes sense since there some stuff shared between the processes) or if it is a bug in my code.

commented

@dougsouza , I just add a line : print(len(data_loader)) , and I use 3 gpus, I have 300 batches, and each process output in terminal 100, So I thought the data should be correct.Can I ask how many memory your GPU0 occupy more than other gpu?

@yzl96, I don't have an exact number. The processes 1 and 2 allocate some memory in GPU 0, it is not much, like 200 MiB or so. The size depends on the model size I guess.

commented

I fixed the problem, now it works fine now, I just change the code: device = torch.device('cuda:{}'.format(args.local_rank)) to torch.cuda.set_device(args.local_rank) , and change the code: 'input = input.to(device)' to input = input.cuda()
the two gpu occupy exactly the same memory.

@yzl96, thanks for sharing. Will test and update the docs.

Hi, can you give me the final version of above code? I tried above code but it didn't work.