"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load

Question

"Cannot dlopen some GPU libraries." does not List what Libraries Failed to Load

stellarpower opened this issue 16 days ago · comments

stellarpower commented 16 days ago

Issue type

Feature Request

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf-nightly 2.17.0.dev20240504

Custom code

No

OS platform and distribution

Ubuntu Jammy

Mobile device

No response

Python version

3.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I have installed tf-nightly from the official PYPI package, like so:

pip install tf-nightly[and-cuda]

When I load Tensorflow, it isn't seeing my GPU, and I get the message

Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

I normally use the conda-forge packages, in part precisely because it should handle some of these things for me so I don't have to worry. But I saw pip installing a large number of CUDA libraries during the process, so I'd expect most of what I need to be there.

The function MaybeTryDlopenGPULibraries() is responsible for attempting to load the required libraries in at runtime, however, it doesn't tell me what libraries it tried to find, what search path it was using, etc. As I've followed the steps in the guide at that URL, it's not the most helpful diagnostic message without further information.

Whilst short, and therefore not cluttering the screen (which may be good for many situations), the message isn't that helpful to try and work out what the problem is. Obviously on modern complex systems, library search paths can be pretty finicky to work out, so if not the default behaviour, I'd at least like to see a flag/environment variable I can set to see output of what library loads were attempted, what succeeded (and the path), and what was missing, in addition to other debugging output. If the short form of the message is kept as the default behaviour, then it would be good for this to print out how to set this option so that I can go round again and get more verbose output.

Thanks

Standalone code to reproduce the issue

See aobve

Relevant log output

No response

Surya · Answer 1 · Mon May 06 2024 15:10:07 GMT+0800 (China Standard Time)

Hi @stellarpower ,

You need to install GPU driver manually.After that you need to set LD_LIBRARY_PATH to the path where nvidia libraries installed. You may refer this comment . Please refer #63362 for more details. Thanks

stellarpower · Answer 2 · Mon May 06 2024 22:53:55 GMT+0800 (China Standard Time)

Thanks; I had done all this previously.

But I have opened as an issue irrespective of my own setup, because I believe it should be possible to get more information from the error message. Without knowing what libraries failed to be opened, just re-installing and following the instructions again isn't a particularly efficient way to debug what happened.

Walter Nelson · Answer 3 · Thu May 09 2024 03:53:13 GMT+0800 (China Standard Time)

Yes, I agree. I am running into the same issue now. This is particularly frustrating because of the arcane versioning of CUDA-related toolsets (i.e. the Python packages vs. CUDA vs. the dependency matrix in the documentation). For example:

TensorFlow documentation lists the correct CUDA version as 11.8, so I installed that and updated my $PATH, $LD_LIBRARY_PATH, etc. accordingly (along with cudNN 8.6 as listed).
When I use a fresh Python 3.10 installation to install tensorflow[and-cuda] via Pip, it seems to be defaulting to CUDA runtime 12?

Collecting nvidia-cublas-cu12==12.3.4.1
  Using cached nvidia_cublas_cu12-12.3.4.1-py3-none-manylinux1_x86_64.whl (412.6 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.3.107
  Using cached nvidia_cuda_nvrtc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (24.9 MB)
Collecting nvidia-curand-cu12==10.3.4.107
  Using cached nvidia_curand_cu12-10.3.4.107-py3-none-manylinux1_x86_64.whl (56.3 MB)
Collecting nvidia-cusparse-cu12==12.2.0.103
  Using cached nvidia_cusparse_cu12-12.2.0.103-py3-none-manylinux1_x86_64.whl (197.5 MB)
Collecting nvidia-nvjitlink-cu12==12.3.101
  Using cached nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)
Collecting nvidia-nccl-cu12==2.19.3
  Using cached nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)
Collecting nvidia-cuda-nvcc-cu12==12.3.107
  Using cached nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (22.0 MB)
Collecting nvidia-cusolver-cu12==11.5.4.101
  Using cached nvidia_cusolver_cu12-11.5.4.101-py3-none-manylinux1_x86_64.whl (125.2 MB)
Collecting nvidia-cudnn-cu12==8.9.7.29
  Using cached nvidia_cudnn_cu12-8.9.7.29-py3-none-manylinux1_x86_64.whl (704.7 MB)
Collecting nvidia-cufft-cu12==11.0.12.1
  Using cached nvidia_cufft_cu12-11.0.12.1-py3-none-manylinux1_x86_64.whl (98.8 MB)
Collecting nvidia-cuda-cupti-cu12==12.3.101
  Using cached nvidia_cuda_cupti_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (14.0 MB)
Collecting nvidia-cuda-runtime-cu12==12.3.101
  Using cached nvidia_cuda_runtime_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (867 kB)

and relevant links in the docs only seem to link out to Docker-related stuff, like https://www.tensorflow.org/install/source so the vast majority of information on the internet is out of date.

Is there any clearer guidance for how to get TensorFlow working on GPUs assuming your CUDA install is non-standard, i.e., not installed out of the Ubuntu package repo (which is infeasible in many academic settings)?

Thanks very much in advance.

EDIT: I was able to resolve this by using the TF_CPP_MAX_VLOG_LEVEL=3 (something buried in the above linked issue) to debug. It turned out that our new module system was nuking my LD_LIBRARY_PATH after cudNN was imported, so CUDA could be found but cudNN could not. Adding a note about this option to the error message around GPUs could potentially save a lot of grief (even in scenarios like mine where the issue lies not with TensorFlow, but something upstream). Just a thought. May help you as well @stellarpower (seems to be what you were looking for when you opened the issue)

stellarpower · Answer 4 · Thu May 09 2024 05:41:50 GMT+0800 (China Standard Time)

@wjno thanks - I resolved the underlying problem in the end, and from memory thought I had increased the log verbosity as high as it would go, but maybe I had not. If I encounter some library problems again I'll give it a go. Cheers!

Surya · Answer 5 · Fri May 10 2024 14:23:07 GMT+0800 (China Standard Time)

EDIT: I was able to resolve this by using the TF_CPP_MAX_VLOG_LEVEL=3 (something buried in the above linked issue) to debug. It turned out that our new module system was nuking my LD_LIBRARY_PATH after cudNN was imported, so CUDA could be found but cudNN could not. Adding a note about this option to the error message around GPUs could potentially save a lot of grief (even in scenarios like mine where the issue lies not with TensorFlow, but something upstream). Just a thought. May help you as well @stellarpower (seems to be what you were looking for when you opened the issue)

Hi @wjn0 , AFAIK the setting TF_CPP_MAX_VLOG_LEVEL=3 will only disable the debugging logs from the console.I doubt and want to know whether after disabling these logs then only cudnn libraries are being detectable? Setting right path for LD_LIBRARY_PATH should resolve the issue irrespective of disabling the debugging logs.Correct me if i am wrong.

Thanks for the info.

github-actions · Answer 6 · Sat May 18 2024 09:48:16 GMT+0800 (China Standard Time)

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.