can you give a complete example?

Question

can you give a complete example?

yzl96 opened this issue 5 years ago · comments

for example, I want to use gpu2, gpu4, gpu5 to train the net, how can I set ?

Douglas Souza · Answer 1 · Fri May 10 2019 09:34:37 GMT+0800 (China Standard Time)

@yzl96, you can set the CUDA_VISIBLE_DEVICES environment variable to mask the GPUs you want. In your case, you could do the following:

$ export CUDA_VISIBLE_DEVICES=2,4,5
$ python your_train_script.py

In your train script, the available GPUs will be 0,1,2, because that's the GPUs that are available to it.

yzl96 · Answer 2 · Fri May 10 2019 10:10:22 GMT+0800 (China Standard Time)

`import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed
import torchvision
import torchvision.transforms as transforms
import numpy as np
import os
#import argparse

batch_size = 64
#os.environ['CUDA_VISIBLE_DEVICES'] = '3,4,5'
#parser.add_argument('--local_rank', type=int, default=3)

device = torch.device('cuda:{}'.format(3))

train_set = torchvision.datasets.FashionMNIST(
root='./data/FashionMNIST',
train=True,
download=True,
transform=transforms.Compose([transforms.ToTensor(),])
)

train_loader = torch.utils.data.DataLoader(train_set,batch_size=batch_size,shuffle=True)

test_set = torchvision.datasets.FashionMNIST(
root='./data/FashionMNIST',
train=False,
download=True,
transform=transforms.Compose([transforms.ToTensor()])
)

test_loader = torch.utils.data.DataLoader(test_set,batch_size=batch_size,shuffle=True)

class Lenet(nn.Module):
def init(self):
super(Lenet,self).init()
self.conv1 = nn.Conv2d(in_channels=1,out_channels=6,kernel_size=5,stride=1)
self.conv2 = nn.Conv2d(in_channels=6,out_channels=16,kernel_size=5,stride=1)
self.conv3 = nn.Conv2d(in_channels=16,out_channels=32,kernel_size=4,stride=1)

    self.fc1 = nn.Linear(in_features=32,out_features=16,bias=True)
    self.fc2 = nn.Linear(in_features=16,out_features=10,bias=True)
    
def forward(self,x):
    x = self.conv1(x)
    x = F.max_pool2d(x,2)
    x = F.relu(x)
    x = self.conv2(x)
    x = F.max_pool2d(x,2)
    x = F.relu(x)
    x = self.conv3(x)
    x = F.relu(x)
    x = self.fc1(x)
    x = self.fc2(x)
    
    return x

net = Lenet()

net = net.to(device)

world_size = args.ngpu
torch.distributed.init_process_group(
'nccl',
init_method='env://',
world_size=world_size,
rank=3,
)

net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)

net = torch.nn.parallel.DistributedDataParallel(
net,
device_ids=[3],
output_device=3,
)

sampler = torch.utils.data.distributed.DistributedSampler(
train_set,
num_replicas=config.ngpu,
rank=3,
)
data_loader = DataLoader(
train_set,
batch_size=batch_size,
num_workers=8,
pin_memory=True,
sampler=sampler,
drop_last=True,
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

for it, (inpu, target) in enumerate(data_loader):
inpu, target = inpu.to(device), target.to(device)
optimizer.zero_grad()
outputs = net(inpu)
loss = criterion(outputs,labels)
loss.backward()
optimizer.step()

`

@dougsouza , I set export CUDA_VISIBLE_DEVICES=3,4,5
according to your markdown, I write the code above, the terminal output is below:

File "multi_gpu.py", line 62, in
net = net.to(device)
net = net.to(device)
net = net.to(device)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 381, in to
RuntimeError: CUDA error: invalid device ordinal
return self._apply(convert)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 381, in to
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return self._apply(convert)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: module._apply(fn)
CUDA error: invalid device ordinal
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/root/anaconda3/envs/torch27/lib/python2.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

my gpu 3 4 5 is available, I set the local_rank 3.

Douglas Souza · Answer 3 · Fri May 10 2019 10:13:57 GMT+0800 (China Standard Time)

@yzl96, when you set the CUDA_VISIBLE_DEVICES=3,4,5 your saying there are 3 gpus available. So your program sees 3 gpus: 0,1,2, therefore index 3 is out of bounds.

yzl96 · Answer 4 · Fri May 10 2019 10:15:36 GMT+0800 (China Standard Time)

It seems like three process all use gpu3, how to set different gpu for different process, I see lots of people use multiprocess.spawn, what I should do to set different gpu for different process.

yzl96 · Answer 5 · Fri May 10 2019 10:16:35 GMT+0800 (China Standard Time)

OK, I make a mistake , so 0 means gpu3?

Douglas Souza · Answer 6 · Fri May 10 2019 10:18:21 GMT+0800 (China Standard Time)

@yzl96, yes, 0 is probably gpu3.

Douglas Souza · Answer 7 · Fri May 10 2019 10:24:10 GMT+0800 (China Standard Time)

About launching the processes, you are correct. There are two ways to launch the processes: using multiporcess.spawn or using torch.distributed.launch. In the example in here we launch using the second approach. In this case you need to parse the local_rank argument, you then use this argument to send your model and data to for its device. For example, you could do the following:

export CUDA_VISIBLE_DEVICES=2,4,5
$ python -m torch.distributed.launch --nproc_per_node=3 your_train_script.py \
--arg1=arg1 --arg2=arg2 --arg3=arg3 --arg4=arg4 --argn=argn

In this case, your your_train_script.py will be executed 3 times, each time receiving a different local_rank. First to be launched wiil be rank 0, second rank 1 and third rank 2.

yzl96 · Answer 8 · Fri May 10 2019 10:24:18 GMT+0800 (China Standard Time)

OK，thanks，i should go to class，i will give it a try later

yzl96 · Answer 9 · Fri May 10 2019 15:39:52 GMT+0800 (China Standard Time)

@dougsouza It works now, thanks for your help,bro.

yzl96 · Answer 10 · Fri May 10 2019 21:49:06 GMT+0800 (China Standard Time)

@dougsouza , the code can work, but there are something I 'm confused, the gpu0 occupy more memory than other gpu, when I use 3 gpus, the memory gpu0 occupy is three times of gpu1 and gpu2, when I use 2 gpus, the memory gpu0 occupy is two times of gpu1.

Douglas Souza · Answer 11 · Fri May 10 2019 22:44:00 GMT+0800 (China Standard Time)

@yzl96, you need to check in your code if you're sending the model an data correctly to each GPU. Also, I notice as well that the GPU 0 (master) uses more memory, I don't know yet if that's normal behaviour (makes sense since there some stuff shared between the processes) or if it is a bug in my code.

yzl96 · Answer 12 · Fri May 10 2019 23:07:36 GMT+0800 (China Standard Time)

@dougsouza , I just add a line : print(len(data_loader)) , and I use 3 gpus, I have 300 batches, and each process output in terminal 100, So I thought the data should be correct.Can I ask how many memory your GPU0 occupy more than other gpu?

Douglas Souza · Answer 13 · Fri May 10 2019 23:09:49 GMT+0800 (China Standard Time)

@yzl96, I don't have an exact number. The processes 1 and 2 allocate some memory in GPU 0, it is not much, like 200 MiB or so. The size depends on the model size I guess.

yzl96 · Answer 14 · Sat May 11 2019 22:20:52 GMT+0800 (China Standard Time)

I fixed the problem, now it works fine now, I just change the code: device = torch.device('cuda:{}'.format(args.local_rank)) to torch.cuda.set_device(args.local_rank) , and change the code: 'input = input.to(device)' to input = input.cuda()
the two gpu occupy exactly the same memory.

Douglas Souza · Answer 15 · Mon May 13 2019 21:52:38 GMT+0800 (China Standard Time)

@yzl96, thanks for sharing. Will test and update the docs.

Thanh Hau Nguyen · Answer 16 · Mon Dec 09 2019 16:12:20 GMT+0800 (China Standard Time)

Hi, can you give me the final version of above code? I tried above code but it didn't work.

AndyYuan96 · Answer 17 · Thu Dec 12 2019 15:07:19 GMT+0800 (China Standard Time)

@thanhhau097 , try https://github.com/AndyYuan96/pytorch-distributed-training, I already tested.