docker file for bytescheduler does not work

Question

docker file for bytescheduler does not work

zarzen opened this issue 3 years ago · comments

Describe the bug

Not able to run pytorch_horovod_benchmark.py in docker container built from docker-file

Get following errors:

[1,1]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,1]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,1]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,1]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,1]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,1]<stderr>:    timing = self.inner(it, self.timer)
[1,1]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,1]<stderr>:    _func()
[1,1]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,1]<stderr>:    output = model(data)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,1]<stderr>:    result = self.forward(*input, **kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,1]<stderr>:    x = self.conv1(x)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,1]<stderr>:    result = self.forward(*input, **kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,1]<stderr>:    self.padding, self.dilation, self.groups)
[1,1]<stderr>:RuntimeError[1,1]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,0]<stdout>:Model: resnet50
[1,0]<stdout>:Batch size: 32
[1,0]<stdout>:Number of GPUs: 4
[1,0]<stdout>:Running warmup...
[1,0]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,0]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,0]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,0]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,0]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,0]<stderr>:    timing = self.inner(it, self.timer)
[1,0]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,0]<stderr>:    _func()
[1,0]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,0]<stderr>:    output = model(data)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,0]<stderr>:    result = self.forward(*input, **kwargs)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,0]<stderr>:    x = self.conv1(x)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,0]<stderr>:    result = self.forward(*input, **kwargs)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,0]<stderr>:    self.padding, self.dilation, self.groups)
[1,0]<stderr>:RuntimeError[1,0]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,3]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,3]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,3]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,3]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,3]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,3]<stderr>:    timing = self.inner(it, self.timer)
[1,3]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,3]<stderr>:    _func()
[1,3]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,3]<stderr>:    output = model(data)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,3]<stderr>:    result = self.forward(*input, **kwargs)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,3]<stderr>:    x = self.conv1(x)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,3]<stderr>:    result = self.forward(*input, **kwargs)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,3]<stderr>:    self.padding, self.dilation, self.groups)
[1,3]<stderr>:RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,2]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,2]<stderr>:Traceback (most recent call last):
[1,2]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,2]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,2]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,2]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,2]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,2]<stderr>:    timing = self.inner(it, self.timer)
[1,2]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,2]<stderr>:    _func()
[1,2]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,2]<stderr>:    output = model(data)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,2]<stderr>:    result = self.forward(*input, **kwargs)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,2]<stderr>:    x = self.conv1(x)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,2]<stderr>:    result = self.forward(*input, **kwargs)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,2]<stderr>:    self.padding, self.dilation, self.groups)
[1,2]<stderr>:RuntimeError[1,2]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

To Reproduce
Steps to reproduce the behavior:

build the docker image from docker-file
run with nvidia-docker run -it b9f6229bb5d5
run benchmark script horovodrun -np 4 -H localhost:4 python pytorch_horovod_benchmark.py
See error

Expected behavior
Out put the performance number

Environment (please complete the following information):

OS: Amazon Linux 2
GCC version:
CUDA and NCCL version: cuda-11.0
Framework (TF, PyTorch, MXNet): PyTorch

Additional context
The issue might due the incompatible CUDA version outside of the container.
Could you provide environment setups for bytescheduler?

Zhen Zhang · Answer 1 · Wed Jul 28 2021 11:02:18 GMT+0800 (China Standard Time)

BTW, are *.cc, *.h files, in bytescheduler/bytescheduler/pytorch/ folder, still useful for pytorch+horovod to run?
I didn't see the compilation step for those file in the docker file.

yhpeng · Answer 2 · Wed Jul 28 2021 11:07:38 GMT+0800 (China Standard Time)

@zarzen The PyTorch version used in the Dockerfile is 1.0. I do not think it could work well with cuda 11.0. The *.cc, *.h files, in bytescheduler/bytescheduler/pytorch/ folder are useful for the torch/horovod versions in the Dockerfile.

Zhen Zhang · Answer 3 · Wed Jul 28 2021 20:58:01 GMT+0800 (China Standard Time)

@pengyanghua I think the docker-image has its own CUDA-9. During the docker-image process, I do see it fetches some images with CUDA-9. But don't know why the nvidia-smi command inside docker shows version 11.0.

Does cuda-10 work for the PyTorch-1.0?
Does bytescheduler work with more recent PyTorch, e.g., 1.8?

yhpeng · Answer 4 · Wed Jul 28 2021 21:11:30 GMT+0800 (China Standard Time)

@zarzen The nvidia driver you used may be mounted from the host. If you can not even run the basic horovod example, then there must be the problem of your environment.

yhpeng · Answer 5 · Wed Jul 28 2021 21:14:12 GMT+0800 (China Standard Time)

There is pytorch1.0 built with cuda 10, check https://pytorch.org/get-started/previous-versions/. We did not test bytescheduler with more recent pytorch. I think the idea is similar, but you may need to modify some Horovod code to address version issues.

Zhen Zhang · Answer 6 · Wed Jul 28 2021 22:52:32 GMT+0800 (China Standard Time)

@zarzen The nvidia driver you used may be mounted from the host. If you can not even run the basic horovod example, then there must be the problem of your environment.

Do you know which amazon image could run the docker container out-of-box?
I have tried Deep Learning AMI version 22.0, which has nvidia driver 10.2. It did not work.
It would be great if you could provide a AMI number.

Zhen Zhang · Answer 7 · Tue Aug 03 2021 04:56:23 GMT+0800 (China Standard Time)

set cudnn.benchmark=False solve the issue