tensorflow / profiler

A profiling and performance analysis tool for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

libcupti.so.10.1 not found, but the path to it's folder has been exported to LD_LIBRARY_PATH

rbavery opened this issue · comments

Setup

I'm running the profiler from the README instructions with

python3 ~/profiler/install_and_run.py --envdir=~/profile_env --logdir=~/profiler/demo

I have a TFServing app running with two ports open, one for post requests and one for tensorboard

docker run \
    -p 8501:8501 \
    -p 8500:8500 --gpus all -it devseeddeploy/aiaia_fastrcnn:v1.2_wildlife-gpu

And a script to test the api.

Problem

Whenever I test the api endpoint, similar to this example I get the following error in the logs of my TFServing docker container:

2021-01-13 01:40:30.622358: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-13 01:40:30.622523: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

The tensorflow profiler is still triggered to save out profile information, but the information seems to be empty because libcupti.so.10.1 isn't found in /usr/local/nvidia/lib or /usr/local/nvidia/lib64, which don't exist.

Info

I'm not sure why it's looking for libcupti.so.10.1 in these paths, since this is what my LD_LIBRARY_PATH variable contains:

echo $LD_LIBRARY_PATH
/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64

I followed #209 to symlink the /usr/local/cuda folder to the 10.1 version since I have multiple cuda versions installed on my machine but this doesn't seem to solve the problem.

This is the output of ldconfig

→ /sbin/ldconfig -N -v $(sed 's/:/ /g' <<< $LD_LIBRARY_PATH) | grep libcupti
/sbin/ldconfig.real: Path `/usr/local/cuda-10.1/targets/x86_64-linux/lib' given more than once
/sbin/ldconfig.real: Can't stat /usr/local/lib/i386-linux-gnu: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/lib/i686-linux-gnu: No such file or directory
/sbin/ldconfig.real: Can't stat /lib/i686-linux-gnu: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/lib/i686-linux-gnu: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path `/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: /lib/i386-linux-gnu/ld-2.27.so is the dynamic linker, ignoring

	libcupti.so.10.1 -> libcupti.so.10.1.59
	libcupti.so.11.0 -> libcupti.so.2020.1.1
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/ld-2.27.so is the dynamic linker, ignoring

	libcupti.so.9.1 -> libcupti.so.9.1.85
/sbin/ldconfig.real: /lib32/ld-2.27.so is the dynamic linker, ignoring

The cuda versions I have on my system

→ ls -l /usr/local 
total 56
drwxr-xr-x  3 root root 4096 Sep 11  2019 bda
drwxr-xr-x  2 root root 4096 Jan 11 10:29 bin
lrwxrwxrwx  1 root root   20 Jan 12 17:07 cuda -> /usr/local/cuda-10.1
drwxr-xr-x 15 root root 4096 Jan 12 14:36 cuda-10.1
drwxr-xr-x  3 root root 4096 Dec  8 11:45 cuda-10.2
drwxr-xr-x 15 root root 4096 Dec  8 13:48 cuda-11.0
And the full log from my TFserving container

# rave at rave-desktop in ~ [17:29:35]
→     docker run \
    -p 8501:8501 \
    -p 8500:8500 --gpus all -it devseeddeploy/aiaia_fastrcnn:v1.2_wildlife-gpu
2021-01-13 01:32:01.958436: I tensorflow_serving/model_servers/server.cc:87] Building single TensorFlow model file config:  model_name: wildlife model_base_path: /models/wildlife
2021-01-13 01:32:01.959219: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2021-01-13 01:32:01.959232: I tensorflow_serving/model_servers/server_core.cc:575]  (Re-)adding model: wildlife
2021-01-13 01:32:02.060684: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: wildlife version: 1}
2021-01-13 01:32:02.060758: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: wildlife version: 1}
2021-01-13 01:32:02.060788: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: wildlife version: 1}
2021-01-13 01:32:02.060891: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/wildlife/001
2021-01-13 01:32:02.157609: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2021-01-13 01:32:02.157652: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Reading SavedModel debug info (if present) from: /models/wildlife/001
2021-01-13 01:32:02.157968: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-13 01:32:02.159254: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-01-13 01:32:02.167328: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-13 01:32:02.167791: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.683GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
2021-01-13 01:32:02.167801: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2021-01-13 01:32:02.167832: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-13 01:32:02.168179: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-13 01:32:02.168457: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-01-13 01:32:03.024682: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-13 01:32:03.024706: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-01-13 01:32:03.024711: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-01-13 01:32:03.024799: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-13 01:32:03.025213: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-13 01:32:03.025546: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-13 01:32:03.025847: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9344 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-01-13 01:32:03.242248: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:199] Restoring SavedModel bundle.
2021-01-13 01:32:03.872915: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:303] SavedModel load for tags { serve }; Status: success: OK. Took 1812031 microseconds.
2021-01-13 01:32:03.886363: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/wildlife/001/assets.extra/tf_serving_warmup_requests
2021-01-13 01:32:03.906395: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: wildlife version: 1}
2021-01-13 01:32:03.909437: I tensorflow_serving/model_servers/server.cc:367] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2021-01-13 01:32:03.911055: I tensorflow_serving/model_servers/server.cc:387] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
2021-01-13 01:40:30.621179: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2021-01-13 01:40:30.621855: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2021-01-13 01:40:30.622358: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-13 01:40:30.622523: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-13 01:40:30.622551: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-01-13 01:40:31.743952: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2021-01-13 01:40:31.749805: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2021-01-13 01:40:31.749892: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-01-13 01:40:32.875007: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2021-01-13 01:40:32.879975: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2021-01-13 01:40:32.880057: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-01-13 01:40:34.008887: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2021-01-13 01:40:34.013748: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2021-01-13 01:40:34.013830: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-01-13 01:40:35.144981: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2021-01-13 01:41:09.295007: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2021-01-13 01:41:09.295041: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-01-13 01:41:10.408071: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2021-01-13 01:41:10.409241: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2021-01-13 01:41:10.409259: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-01-13 01:41:11.533950: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2021-01-13 01:41:11.544502: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: ~/profiler/demo/plugins/profile/plugins/profile/2021_01_12_17_41_09
2021-01-13 01:41:11.549438: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to ~/profiler/demo/plugins/profile/plugins/profile/2021_01_12_17_41_09/localhost:8500.trace.json.gz
2021-01-13 01:41:11.584176: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: ~/profiler/demo/plugins/profile/plugins/profile/2021_01_12_17_41_09
2021-01-13 01:41:11.592587: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for memory_profile.json.gz to ~/profiler/demo/plugins/profile/plugins/profile/2021_01_12_17_41_09/localhost:8500.memory_profile.json.gz

Another piece of info that may or may nto be important is that when I start the profiler, it loads libcudart.so.11.0 instead of libcudart.so.10.1.

2021-01-13 09:41:01.277856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
TensorBoard 2.5.0a20210113 at http://rave-desktop:6006/ (Press CTRL+C to quit)

when you do "echo $LD_LIBRARY_PATH" , are you doing in inside docker or outside?,
the LD_LIBRARY_PATH in a log is different, make sure that you do "ldconfig -p | grep libcupti" INSIDE docker.

if you add a symbol link , it should be done in docker file too.

Thanks for the feedback, I'll try this out this week or early next and report back

I've started my TFserving Docker container and am trying to see what's wrong within it. It looks like this is related to tensorflow/serving#1718

It looks like Cuda 10.1 is properly detected judging by nvcc

root@389958af6602:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

But the LD_LIBRARY_PATH is incorrect, the nvidia folder doesn't exist in the container. What should this be set to?

root@389958af6602:~# echo $LD_LIBRARY_PATH
/usr/local/nvidia/lib:/usr/local/nvidia/lib64
root@389958af6602:~# ls /usr/local/
bin  cuda  cuda-10.1  etc  games  include  lib  man  sbin  share  src

I tried exporting LD_LIBRARY_PATH like so

root@389958af6602:~# export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64

since this folder contains libcupti.so.10.1

root@389958af6602:~# ls /usr/local/cuda/extras/CUPTI/lib64
libcupti.so           libcupti_static.a        libnvperf_target.so
libcupti.so.10.1      libnvperf_host.so
libcupti.so.10.1.208  libnvperf_host_static.a

but when I capture a profile I get a new error

2021-02-19 00:00:44.337578: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.

Any updates on how to get the profiler to work with tferving @yisitu ? Happy to provide more debugging info this week if it helps.

If there is interest on the part of the maintainers in getting the profiler to work with tensorflow serving, I'm available to help debug and try things out.

commented

I am guessing that the required version of CUPTI is not there. Could you describe your environment?

  • Where is the docker image from
  • Which version of TensorFlow is in the docker
    ** While attempting to make the host CUPTI available to the docker, does the CUPTI version match the TensorFlow version? https://www.tensorflow.org/install/source#gpu

Also I am wondering if the docker image should be shipped with the required GPU libraries, if it is meant to be used with GPUs. Seems like a good idea to reduce setup burden.