Problem on making tensorflow work with gpu (for 10.2_2.1.0_4.3.0-20200423)

Question

Problem on making tensorflow work with gpu (for 10.2_2.1.0_4.3.0-20200423)

OkenKhuman opened this issue 4 years ago · comments

First, Thanks for helping me out last time.

On working with "datamachines/cudnn_tensorflow_opencv:10.2_2.1.0_4.3.0-20200423" image, I have no problem on enabling cuda support but when I try to use Tensorflow with gpu I have issue of not able to detect my GPU i.e. when I enter "import tensorflow as tf;print(len(tf.config.experimental.list_physical_devices('GPU')))" returns me 0

Is there a way to fix it or do I need to download another image with cuda10.1?

Please help me out

If possible also please mension a way to install darknetpy in any of the image (I think it will be very good enhanceement for ML docker images like this)

Martial Michel · Answer 1 · Thu Jun 11 2020 03:45:33 GMT+0800 (China Standard Time)

Hello Oken, sorry for the lag, I just saw this.
How are you running the container? Are you using docker --gpus=all?
As for Darknet, I see that Yolov4 is out, I was going to update a Dockerfile I had to build it. Maybe I will add it in an example directory when this is done.

Oken Khuman · Answer 2 · Fri Jun 12 2020 15:52:45 GMT+0800 (China Standard Time)

Yes I use the docker --gpu=all command prefix. OpenCV's DNN (GPU) and other GPU backend packages like CuPy works well.
Only the Tensorflow is not able to detect / use my GPU

Martial Michel · Answer 3 · Wed Jun 17 2020 12:38:23 GMT+0800 (China Standard Time)

tl;dr: still looking into it

long: I am still investigating the main issue but TF requires CuDNN to work, so the "cuda" version will have to be CPU-bound. As I was looking into it, it appears the pip installed version is bound to older version of CUDA (10.0) and is hard-linked to those libraries, so I added some workarounds in the develop-linux branch as well as some tests (in test to run some simple TF code on CPU and GPU).

Martial Michel · Answer 4 · Fri Jun 19 2020 01:38:41 GMT+0800 (China Standard Time)

Have some preliminary content in the develop-linux branch that now builds TF from source.
TF needs the DNN base to compile the GPU dependent part.

Oken Khuman · Answer 5 · Sun Jun 21 2020 02:08:15 GMT+0800 (China Standard Time)

Yesterday I was able to fully download and use your "datamachines/cudnn_tensorflow_opencv:10.1_2.1.0_4.3.0" build. Hopefully the TF there works with GPU :-).
Thanks again for this wonderful image it is very helpful for engineering student like me.
And if you have any paper based on this I want to cite it on project I am working or is it ok if I just give reference to this?

Martial Michel · Answer 6 · Sun Jun 21 2020 06:10:51 GMT+0800 (China Standard Time)

Hi Oken,

I am currently building the "20200615" release, which will have TF built from source and will make use of the local CUDA and CuDNN. I would recommend waiting a couple more days (moved the compilation to a system with a lot more cores, and it is still taking a long time) before trying with this version.

If you can not wait for this release, I would encourage you to check out the develop-linux branch and compile the one version that will work best for you. On my gaming laptop (what I was using before compiling TF as well), it takes 4-5 hours per build.

Another option is to run this script to load the CUDA 10.0 libraries for TF to use them but this was more a workaround than an update solution; see:
e6d8d0c

In the test directory, you will see a few python scripts starting with tf_, I would test those in the running container to see what the system sees.

Martial Michel · Answer 7 · Sun Jun 21 2020 06:21:32 GMT+0800 (China Standard Time)

Related to reference, feel free to reference the GitHub.

We also published an article that introduced this abstraction:
"Enabling GPU-Enhanced Computer Vision and Machine Learning Research Using Containers" (Dec 2019) High Performance Computing - ISC High Performance 2019 International Workshops, Lecture Notes in Computer Science Volume 11887
https://link.springer.com/chapter/10.1007/978-3-030-34356-9_8

Martial Michel · Answer 8 · Wed Jun 24 2020 09:59:41 GMT+0800 (China Standard Time)

I have committed to the linux-develop branch a re-factorization of the Dockerfile which has so far successfully built all the cudnn- variants. I am waiting for all of them to be compiled before calling it a success and pushing the images as well.

Martial Michel · Answer 9 · Wed Jun 24 2020 21:37:10 GMT+0800 (China Standard Time)

Confirming the 20200615 release will solve this (being pushed to DockerHub currently)
Note that you will want to use the cudnn- variant to get GPU access.
Run test/tf_hw.py to obtain the list of functional hardware; when you see the verbose load for CUDA components, you will be given details on your GPU hardware, confirming it is present.

Closing this issue at this point.

Martial Michel · Answer 10 · Thu Jun 25 2020 12:47:40 GMT+0800 (China Standard Time)

20200615 is now released and pre-built images available on Dockerhub.

Martial Michel · Answer 11 · Fri Jun 26 2020 11:17:22 GMT+0800 (China Standard Time)

Following your question, I added this
https://github.com/datamachines/cuda_tensorflow_opencv#TestingYolov4onyourwebcamLinuxandGPUonly

Might extend it with instructions for https://pypi.org/project/darknetpy/

ozdeadman · Answer 12 · Fri Jun 26 2020 11:47:29 GMT+0800 (China Standard Time)

Off-topic, but just wanted to thank you for your hard work on what is contained in this repo. I've wasted way too much time over the last few years getting TF, OpenCV, CUDA to play nicely together, and this repo means myself and others hopefully need to spend far less time doing so. So thankyou!

Martial Michel · Answer 13 · Fri Jun 26 2020 11:54:37 GMT+0800 (China Standard Time)

You are quite welcome, I use this container very often for the same reason: I need a ready set of tools just to get some OpenCV code functional, and hopefully soon will extend the JetsonNano one for doing analytics at the edge :)

Martial Michel · Answer 14 · Fri Jun 26 2020 11:59:32 GMT+0800 (China Standard Time)

darknetpy would unfortunately not be a good solution for using with CTO, it tries to compile Yolo.

But PyYolo (https://github.com/goktug97/PyYOLO) uses the already installed OpenCV and libdarknet.so, so I have confirmed that it works by using their sample.py code; see https://github.com/datamachines/cuda_tensorflow_opencv#641-using-pyyolo