maciej-sypetkowski / kaggle-rcic-1st

1st Place Solution for Kaggle Recursion Cellular Image Classification Challenge -- https://www.kaggle.com/c/recursion-cellular-image-classification/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Found no NVIDIA driver on your system

WurmD opened this issue · comments

commented

Hello,

After building the image, and running it

sudo docker build --tag testimage .
sudo docker run -t -i --privileged testimage bash
cd rcic/
python main.py --save testrun

We get

	Traceback (most recent call last):
	  File "main.py", line 504, in <module>
		main(args)
	  File "main.py", line 484, in main
		model = ModelAndLoss(args).cuda()
	  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 297, in cuda
		return self._apply(lambda t: t.cuda(device))
	  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 194, in _apply
		module._apply(fn)
	  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 194, in _apply
		module._apply(fn)
	  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 194, in _apply
		module._apply(fn)
	  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in _apply
		param_applied = fn(param)
	  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 297, in <lambda>
		return self._apply(lambda t: t.cuda(device))
	  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 178, in _lazy_init
		_check_driver()
	  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 99, in _check_driver
		http://www.nvidia.com/Download/index.aspx""")
	AssertionError: 
	Found no NVIDIA driver on your system. Please check that you
	have an NVIDIA GPU and installed a driver from
	http://www.nvidia.com/Download/index.aspx

Note that outside docker the GPU is working as intended

$ nvidia-smi
Sat Sep 26 17:26:50 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 33%   41C    P8    11W / 180W |      1MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Am I not running your code as intended?
What were the steps you took to run the code in docker in your machine?

Add --gpus=all to your docker run command or change docker run to nvidia-docker run. By default docker doesn't pass gpus to the container. If you do this, you should be able to run nvidia-smi inside the container and get the same output as run outside the container.

commented

I confirm installing nvidia-container-toolkit as per https://stackoverflow.com/a/58432877/1734357 and then adding --gpus=all to docker run resolves it