bytedance / byteps

A high performance and generic framework for distributed DNN training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

docker file for bytescheduler does not work

zarzen opened this issue · comments

Describe the bug

Not able to run pytorch_horovod_benchmark.py in docker container built from docker-file

Get following errors:

[1,1]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,1]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,1]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,1]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,1]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,1]<stderr>:    timing = self.inner(it, self.timer)
[1,1]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,1]<stderr>:    _func()
[1,1]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,1]<stderr>:    output = model(data)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,1]<stderr>:    result = self.forward(*input, **kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,1]<stderr>:    x = self.conv1(x)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,1]<stderr>:    result = self.forward(*input, **kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,1]<stderr>:    self.padding, self.dilation, self.groups)
[1,1]<stderr>:RuntimeError[1,1]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,0]<stdout>:Model: resnet50
[1,0]<stdout>:Batch size: 32
[1,0]<stdout>:Number of GPUs: 4
[1,0]<stdout>:Running warmup...
[1,0]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,0]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,0]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,0]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,0]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,0]<stderr>:    timing = self.inner(it, self.timer)
[1,0]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,0]<stderr>:    _func()
[1,0]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,0]<stderr>:    output = model(data)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,0]<stderr>:    result = self.forward(*input, **kwargs)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,0]<stderr>:    x = self.conv1(x)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,0]<stderr>:    result = self.forward(*input, **kwargs)
[1,0]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,0]<stderr>:    self.padding, self.dilation, self.groups)
[1,0]<stderr>:RuntimeError[1,0]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,3]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,3]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,3]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,3]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,3]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,3]<stderr>:    timing = self.inner(it, self.timer)
[1,3]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,3]<stderr>:    _func()
[1,3]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,3]<stderr>:    output = model(data)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,3]<stderr>:    result = self.forward(*input, **kwargs)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,3]<stderr>:    x = self.conv1(x)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,3]<stderr>:    result = self.forward(*input, **kwargs)
[1,3]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,3]<stderr>:    self.padding, self.dilation, self.groups)
[1,3]<stderr>:RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
[1,2]<stderr>:THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
[1,2]<stderr>:Traceback (most recent call last):
[1,2]<stderr>:  File "pytorch_horovod_benchmark.py", line 119, in <module>
[1,2]<stderr>:    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
[1,2]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 237, in timeit
[1,2]<stderr>:    return Timer(stmt, setup, timer).timeit(number)
[1,2]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 202, in timeit
[1,2]<stderr>:    timing = self.inner(it, self.timer)
[1,2]<stderr>:  File "/usr/lib/python2.7/timeit.py", line 100, in inner
[1,2]<stderr>:    _func()
[1,2]<stderr>:  File "pytorch_horovod_benchmark.py", line 100, in benchmark_step
[1,2]<stderr>:    output = model(data)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,2]<stderr>:    result = self.forward(*input, **kwargs)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torchvision/models/resnet.py", line 150, in forward
[1,2]<stderr>:    x = self.conv1(x)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
[1,2]<stderr>:    result = self.forward(*input, **kwargs)
[1,2]<stderr>:  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 320, in forward
[1,2]<stderr>:    self.padding, self.dilation, self.groups)
[1,2]<stderr>:RuntimeError[1,2]<stderr>:: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

To Reproduce
Steps to reproduce the behavior:

  1. build the docker image from docker-file
  2. run with nvidia-docker run -it b9f6229bb5d5
  3. run benchmark script horovodrun -np 4 -H localhost:4 python pytorch_horovod_benchmark.py
  4. See error

Expected behavior
Out put the performance number

Environment (please complete the following information):

  • OS: Amazon Linux 2
  • GCC version:
  • CUDA and NCCL version: cuda-11.0
  • Framework (TF, PyTorch, MXNet): PyTorch

Additional context
The issue might due the incompatible CUDA version outside of the container.
Could you provide environment setups for bytescheduler?

BTW, are *.cc, *.h files, in bytescheduler/bytescheduler/pytorch/ folder, still useful for pytorch+horovod to run?
I didn't see the compilation step for those file in the docker file.

@zarzen The PyTorch version used in the Dockerfile is 1.0. I do not think it could work well with cuda 11.0. The *.cc, *.h files, in bytescheduler/bytescheduler/pytorch/ folder are useful for the torch/horovod versions in the Dockerfile.

@pengyanghua I think the docker-image has its own CUDA-9. During the docker-image process, I do see it fetches some images with CUDA-9. But don't know why the nvidia-smi command inside docker shows version 11.0.

Does cuda-10 work for the PyTorch-1.0?
Does bytescheduler work with more recent PyTorch, e.g., 1.8?

@zarzen The nvidia driver you used may be mounted from the host. If you can not even run the basic horovod example, then there must be the problem of your environment.

There is pytorch1.0 built with cuda 10, check https://pytorch.org/get-started/previous-versions/. We did not test bytescheduler with more recent pytorch. I think the idea is similar, but you may need to modify some Horovod code to address version issues.

@zarzen The nvidia driver you used may be mounted from the host. If you can not even run the basic horovod example, then there must be the problem of your environment.

Do you know which amazon image could run the docker container out-of-box?
I have tried Deep Learning AMI version 22.0, which has nvidia driver 10.2. It did not work.
It would be great if you could provide a AMI number.

set cudnn.benchmark=False solve the issue