docker file for bytescheduler does not work
zarzen opened this issue · comments
Describe the bug
Not able to run pytorch_horovod_benchmark.py in docker container built from docker-file
Get following errors:
[1,1]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,1]<stderr>: timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,1]<stderr>: File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,1]<stderr>: return Timer(stmt, setup, timer).timeit(number)
[1,1]<stderr>: File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,1]<stderr>: timing = self.inner(it, self.timer)
[1,1]<stderr>: File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,1]<stderr>: _func()
[1,1]<stderr>: File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,1]<stderr>: output = model(data)
[1,1]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,1]<stderr>: result = self.forward(*input, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,1]<stderr>: x = self.conv1(x)
[1,1]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,1]<stderr>: result = self.forward(*input, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,1]<stderr>: self.padding, self.dilation, self.groups)
[1,1]<stderr>:RuntimeError[1,1]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,0]<stdout>:Model: resnet50
[1,0]<stdout>:Batch size: 32
[1,0]<stdout>:Number of GPUs: 4
[1,0]<stdout>:Running warmup...
[1,0]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,0]<stderr>: timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,0]<stderr>: File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,0]<stderr>: return Timer(stmt, setup, timer).timeit(number)
[1,0]<stderr>: File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,0]<stderr>: timing = self.inner(it, self.timer)
[1,0]<stderr>: File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,0]<stderr>: _func()
[1,0]<stderr>: File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,0]<stderr>: output = model(data)
[1,0]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,0]<stderr>: result = self.forward(*input, **kwargs)
[1,0]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,0]<stderr>: x = self.conv1(x)
[1,0]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,0]<stderr>: result = self.forward(*input, **kwargs)
[1,0]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,0]<stderr>: self.padding, self.dilation, self.groups)
[1,0]<stderr>:RuntimeError[1,0]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,3]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>: File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,3]<stderr>: timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,3]<stderr>: File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,3]<stderr>: return Timer(stmt, setup, timer).timeit(number)
[1,3]<stderr>: File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,3]<stderr>: timing = self.inner(it, self.timer)
[1,3]<stderr>: File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,3]<stderr>: _func()
[1,3]<stderr>: File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,3]<stderr>: output = model(data)
[1,3]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,3]<stderr>: result = self.forward(*input, **kwargs)
[1,3]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,3]<stderr>: x = self.conv1(x)
[1,3]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,3]<stderr>: result = self.forward(*input, **kwargs)
[1,3]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,3]<stderr>: self.padding, self.dilation, self.groups)
[1,3]<stderr>:RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,2]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,2]<stderr>:Traceback (most recent call last):
[1,2]<stderr>: File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,2]<stderr>: timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,2]<stderr>: File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,2]<stderr>: return Timer(stmt, setup, timer).timeit(number)
[1,2]<stderr>: File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,2]<stderr>: timing = self.inner(it, self.timer)
[1,2]<stderr>: File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,2]<stderr>: _func()
[1,2]<stderr>: File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,2]<stderr>: output = model(data)
[1,2]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,2]<stderr>: result = self.forward(*input, **kwargs)
[1,2]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,2]<stderr>: x = self.conv1(x)
[1,2]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,2]<stderr>: result = self.forward(*input, **kwargs)
[1,2]<stderr>: File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,2]<stderr>: self.padding, self.dilation, self.groups)
[1,2]<stderr>:RuntimeError[1,2]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
To Reproduce
Steps to reproduce the behavior:
- build the docker image from docker-file
- run with
nvidia-docker run -it b9f6229bb5d5
- run benchmark script
horovodrun -np 4 -H localhost:4 python pytorch_horovod_benchmark.py
- See error
Expected behavior
Out put the performance number
Environment (please complete the following information):
- OS: Amazon Linux 2
- GCC version:
- CUDA and NCCL version: cuda-11.0
- Framework (TF, PyTorch, MXNet): PyTorch
Additional context
The issue might due the incompatible CUDA version outside of the container.
Could you provide environment setups for bytescheduler?
BTW, are *.cc, *.h
files, in bytescheduler/bytescheduler/pytorch/
folder, still useful for pytorch+horovod to run?
I didn't see the compilation step for those file in the docker file.
@zarzen The PyTorch version used in the Dockerfile is 1.0. I do not think it could work well with cuda 11.0. The *.cc, *.h files, in bytescheduler/bytescheduler/pytorch/ folder are useful for the torch/horovod versions in the Dockerfile.
@pengyanghua I think the docker-image has its own CUDA-9. During the docker-image process, I do see it fetches some images with CUDA-9. But don't know why the nvidia-smi
command inside docker shows version 11.0.
Does cuda-10 work for the PyTorch-1.0?
Does bytescheduler work with more recent PyTorch, e.g., 1.8?
@zarzen The nvidia driver you used may be mounted from the host. If you can not even run the basic horovod example, then there must be the problem of your environment.
There is pytorch1.0 built with cuda 10, check https://pytorch.org/get-started/previous-versions/. We did not test bytescheduler with more recent pytorch. I think the idea is similar, but you may need to modify some Horovod code to address version issues.
@zarzen The nvidia driver you used may be mounted from the host. If you can not even run the basic horovod example, then there must be the problem of your environment.
Do you know which amazon image could run the docker container out-of-box?
I have tried Deep Learning AMI version 22.0, which has nvidia driver 10.2. It did not work.
It would be great if you could provide a AMI number.
set cudnn.benchmark=False
solve the issue