pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Home Page:https://pytorch.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bus error (core dumped) model share memory

acrosson opened this issue · comments

I'm getting a Bus error (core dumped) when using the share_memory method on a model.

OS : Ubuntu 16.04
It's happening in python 2.7 and 3.5, conda environment and hard install. I'm using the latest version from http://pytorch.org/. I've also tried installing from source, same issue.

I tried doing a basic test using this code:

import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(2563*50, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

n = Net()
n.share_memory()
print('okay')

If the input size is small it works fine, but anything greater than some threshold throws the Bus error. If I don't call share_memory() it works fine.

I ran trace, here are the last few lines of the output.

module.py(391):             if module is not None and module not in memo:                                                                                                                                 [567/1904]
module.py(392):                 memo.add(module)
module.py(393):                 yield name, module
module.py(378):             yield module
module.py(118):             module._apply(fn)
 --- modulename: module, funcname: _apply
module.py(117):         for module in self.children():
 --- modulename: module, funcname: children
module.py(377):         for name, module in self.named_children():
 --- modulename: module, funcname: named_children
module.py(389):         memo = set()
module.py(390):         for name, module in self._modules.items():
module.py(120):         for param in self._parameters.values():
module.py(121):             if param is not None:
module.py(124):                 param.data = fn(param.data)
 --- modulename: module, funcname: <lambda>
module.py(468):         return self._apply(lambda t: t.share_memory_())
 --- modulename: tensor, funcname: share_memory_
tensor.py(86):         self.storage().share_memory_()
 --- modulename: storage, funcname: share_memory_
storage.py(95):         from torch.multiprocessing import get_sharing_strategy
 --- modulename: _bootstrap, funcname: _handle_fromlist
<frozen importlib._bootstrap>(1006): <frozen importlib._bootstrap>(1007): <frozen importlib._bootstrap>(1012): <frozen importlib._bootstrap>(1013): <frozen importlib._bootstrap>(1012): <frozen importlib._bootstra
p>(1025): storage.py(96):         if self.is_cuda:
storage.py(98):         elif get_sharing_strategy() == 'file_system':
 --- modulename: __init__, funcname: get_sharing_strategy
__init__.py(59):     return _sharing_strategy
storage.py(101):             self._share_fd_()
Bus error (core dumped)

I tried running gdb, but it wont give me a full trace.

I've tried creating a symbolic link to the libgomp.so.1 as I suspect it's a similar issue, but still the same error.

Any suggestions? This is running inside a docker container btw.

Okay. I think I solved it. Looks like the shared memory of the docker container wasn't set high enough. Setting a higher amount by adding --shm-size 8G to the docker run command seems to be the trick as mentioned here. Let me fully test it, if solved I'll close issue.

Works fine now!

@acrosson Do you have experience with Google Cloud ML? Sorry for disturbing, but I got this error on cloud ml job with machine params standard_gpu (NVIDIA Tesla K80 GPU, 30GB memory).
How can I configure --shm-size param on Cloud ML Job?

My config.yaml file:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerType: standard_gpu
  workerCount: 1
  parameterServerCount: 1
  parameterServerType: standard_gpu

@dneprDroid Did you figure out how to configure --shm-size on a Cloud ML Job?

Same problem here on google cloud. Any help would be greatly appreciated.

Any progress?

set workers_per_gpu to 0 if using mmdetection module.

cfg.data.workers_per_gpu = 0

I can confirm adding 1Gi EmptyDir /dev/shm to my kubernetes container solved SIGBUS for multi-GPU training with pytorch-lightning ref https://www.sobyte.net/post/2022-04/k8s-pod-shared-memory/

Originally posted by @ddelange in tloen/alpaca-lora#218 (comment)

For bigger models, this needs to go more towards 16Gi