tensorflow / profiler

A profiling and performance analysis tool for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

libcudart.so.11.0 is loaded for Profiler but libcudart.so.10.1 for Training

jun-danieloh opened this issue · comments

This is part of my codes related to Profiler. fit here is not Keras API but custom fit function that I wrote.

     # start training
     with tf.profiler.experimental.Profile('/home/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3'):
         trainer.fit(
             train_dataset,
             valid_dataset,
             saved_path=os.path.join(config["outdir"], "checkpoints/"),
             resume=args.resume)
 
     trainer.save_checkpoint()
     logging.info(f"Successfully saved checkpoint @ {trainer.steps}steps.")

This run went through successfully and got output files as below. I believe this run is okay because I see similar files in profiler/demo folder provided.

root:tboard_v3# ls -Rl
.:
total 80
-rw-r--r-- 1 root root 40 Nov 27 14:34 events.out.tfevents.1606487681.deepbasegpu050.profile-empty
drwxr-xr-x 3 root root 25 Nov 27 14:34 plugins

./plugins:
total 32
drwxr-xr-x 3 root root 37 Nov 27 14:34 profile

./plugins/profile:
total 32
drwxr-xr-x 2 root root 342 Nov 27 14:34 2020_11_27_14_34_30

./plugins/profile/2020_11_27_14_34_30:
total 393720
-rw-r--r-- 1 root root    174892 Nov 27 14:34 deepbasegpu050.input_pipeline.pb
-rw-r--r-- 1 root root   1076916 Nov 27 14:34 deepbasegpu050.kernel_stats.pb
-rw-r--r-- 1 root root     54105 Nov 27 14:34 deepbasegpu050.memory_profile.json.gz
-rw-r--r-- 1 root root    178125 Nov 27 14:34 deepbasegpu050.overview_page.pb
-rw-r--r-- 1 root root    590702 Nov 27 14:34 deepbasegpu050.tensorflow_stats.pb
-rw-r--r-- 1 root root  18606122 Nov 27 14:34 deepbasegpu050.trace.json.gz
-rw-r--r-- 1 root root 280023266 Nov 27 14:34 deepbasegpu050.xplane.pb

My question starts here. I tried to run the Profiler but get errors(actually not errors but I am not able to see any information in Tensorboard) as below. Profiler is trying to load libcudart.so.11.0'; dlerror: libcudart.so.11.0 but libcudart.so.10.1 is loaded while training. What am I missing?

Logs for Profiler:

root:profiler# python3 install_and_run.py --envdir=profile_env --logdir=/home/jun1.oh/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3
^[created virtual environment CPython3.6.9.final.0-64 in 7785ms
  creator CPython3Posix(dest=/home/jun1.oh/daniel_workspace/profiler/profile_env, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: Keras_Preprocessing==1.1.2, Markdown==3.3.3, Werkzeug==1.0.1, absl_py==0.11.0, astunparse==1.6.3, cachetools==4.1.1, certifi==2020.11.8, chardet==3.0.4, flatbuffers==1.12, gast==0.3.3, google_auth==1.23.0, google_auth_oauthlib==0.4.2, google_pasta==0.2.0, grpcio==1.32.0, gviz_api==1.9.0, h5py==2.10.0, idna==2.10, importlib_metadata==3.1.0, numpy==1.19.4, oauthlib==3.1.0, opt_einsum==3.3.0, pip==20.2.4, protobuf==3.13.0, pyasn1==0.4.8, pyasn1_modules==0.2.8, requests==2.25.0, requests_oauthlib==1.3.0, rsa==4.6, setuptools==50.3.2, six==1.15.0, tb_nightly==2.5.0a20201129, tbp_nightly==2.4.0a20201129, tensorboard_plugin_wit==1.7.0, termcolor==1.1.0, tf_estimator_nightly==2.4.0.dev2020102301, tf_nightly==2.5.0.dev20201129, typing_extensions==3.7.4.3, urllib3==1.26.2, wheel==0.35.1, wrapt==1.12.1, zipp==3.4.0
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
WARNING: Skipping tensorboard-plugin-profile as it is not installed.
WARNING: Skipping tensorboard as it is not installed.
WARNING: Skipping tensorflow-estimator as it is not installed.
WARNING: Skipping tensorflow as it is not installed.
2020-11-29 13:05:39.683942: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-29 13:05:39.684011: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
TensorBoard 2.5.0a20201129 at http://deepbasegpu050:6006/ (Press CTRL+C to quit)

Logs for Training:

...
2020-11-27 14:30:46.413751: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-27 14:30:47.593919: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
...

Thanks in advance!!

root:tboard_v3# nvidia-smi
Sun Nov 29 13:54:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

"TensorBoard 2.5.0a20201129 at http://deepbasegpu050:6006/ (Press CTRL+C to quit)", looks like you have TensorBoard opened up successfully.
How about open the profile page at http://deepbasegpu050:6006/#profile ? Is the profile tab empty?

What's your tensorflow version when training? If you're training with tf 2.3, would you try open the 2.3 version of the profiler through the command below:
python3 install_and_run.py --envdir=profile_env --logdir=/home/jun1.oh/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3 --version=2.3

python3 install_and_run.py --envdir=profile_env --logdir=/home/jun1.oh/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3 --version=2.3

@qiuminxu Thanks for your reply. The command above works well and loading libcudart.so.11.0 issue is gone. However, I still can't check any results on Tensorboard. I pasted the full logs below. Do you have any suggestions here? I haven't seen Illegal Content-Security-Policy for script-src: 'unsafe-inline' when running Tensorboard without Profiler and it went through okay.

root:profiler# python3 install_and_run.py --envdir=profile_env --logdir=/home/jun1.oh/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3 --version=2.3
created virtual environment CPython3.6.9.final.0-64 in 8485ms
  creator CPython3Posix(dest=/home/jun1.oh/daniel_workspace/profiler/profile_env, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: Keras_Preprocessing==1.1.2, Markdown==3.3.3, Werkzeug==1.0.1, absl_py==0.11.0, astunparse==1.6.3, cachetools==4.1.1, certifi==2020.11.8, chardet==3.0.4, flatbuffers==1.12, gast==0.3.3, google_auth==1.23.0, google_auth_oauthlib==0.4.2, google_pasta==0.2.0, grpcio==1.32.0, gviz_api==1.9.0, h5py==2.10.0, idna==2.10, importlib_metadata==3.1.0, numpy==1.19.4, oauthlib==3.1.0, opt_einsum==3.3.0, pip==20.2.4, protobuf==3.13.0, pyasn1==0.4.8, pyasn1_modules==0.2.8, requests==2.25.0, requests_oauthlib==1.3.0, rsa==4.6, setuptools==50.3.2, six==1.15.0, tb_nightly==2.5.0a20201129, tbp_nightly==2.4.0a20201129, tensorboard_plugin_wit==1.7.0, termcolor==1.1.0, tf_estimator_nightly==2.4.0.dev2020102301, tf_nightly==2.5.0.dev20201129, typing_extensions==3.7.4.3, urllib3==1.26.2, wheel==0.35.1, wrapt==1.12.1, zipp==3.4.0
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
WARNING: Skipping tensorboard-plugin-profile as it is not installed.
WARNING: Skipping tensorboard as it is not installed.
WARNING: Skipping tensorflow-estimator as it is not installed.
WARNING: Skipping tensorflow as it is not installed.
2020-12-01 06:37:50.849754: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
TensorBoard 2.3.0 at http://deepbasegpu050:6006/ (Press CTRL+C to quit)
W1201 06:38:16.229210 140247161591552 security_validator.py:51] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
W1201 06:38:27.159219 140247144806144 security_validator.py:51] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
W1201 06:39:51.705721 140247153198848 security_validator.py:51] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'

The logs looks fine. Can you take a screenshot of your tensorboard? Also do you see any error message in the console? (right click your chrome browser, select inspect and then go to the console tab)

@qiuminxu Profile tab is empty and main screen says as below. Not sure why it is not working :(

No dashboards are active for the current data set.
Probable causes:

You haven't written any data to your event files.
TensorBoard can't find your event files.
If you're new to using TensorBoard, and want to find out how to add data and set up your event files, check out the README and perhaps the TensorBoard tutorial.
If you think TensorBoard is configured properly, please see the section of the README devoted to missing data problems and consider filing an issue on GitHub.

Last reload: Dec 2, 2020, 10:55:30 AM
  • Commands that I ran: root:profiler# python3 install_and_run.py --envdir=profile_env --logdir=/home/jun1.oh/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3 --version=2.3

  • Files generated with Profiler:

[jun1.oh@deepbasegpu050 tboard_v3]$ ls -R
.:
events.out.tfevents.1606487681.deepbasegpu050.profile-empty  plugins

./plugins:
profile

./plugins/profile:
2020_11_27_14_34_30

./plugins/profile/2020_11_27_14_34_30:
deepbasegpu050.input_pipeline.pb       deepbasegpu050.overview_page.pb     deepbasegpu050.xplane.pb
deepbasegpu050.kernel_stats.pb         deepbasegpu050.tensorflow_stats.pb
deepbasegpu050.memory_profile.json.gz  deepbasegpu050.trace.json.gz

Hmm, I'm running out of ideas, it looks like this should work. What's your system environment and browser version? Would it be possible your tensorboard doesn't have read permission into your directory?

@qiuminxu It is a Docker environment. Chrome version is 84.0.4147.89.

root:profiler# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic

root:profiler# nvidia-smi
Wed Dec  2 02:11:06 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

@qiuminxu This is a current permission status of the directory so I think a read permission should be okay.

drwxrwxrwx  3 root    root         102 11??27 23:34 tboard_v3

Setting up TensorBoard inside docker could be a bit tricky, for example you will need to make sure port 6006 is passed out of the docker. Would you try copying your directory out of docker and see if that works?

For debugging inside docker, you can try a different port say 6009.
python3 install_and_run.py --envdir=profile_env --logdir=/home/jun1.oh/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3 --version=2.3 --port=6009
Sometimes we can have an another TensorBoard opened outside of docker pointing to 6006. So this is trying to rule this out.

@qiuminxu This is not working as well :( It still shows "No dashboards are active for the current data set."

This is my Docker command, docker run --gpus all -it --privileged=true --net=host --name docker_tfboard_v3 -v /home:/home:rw docker.rt.sec.samsung.net/jun1.oh/tf2.3_lpctron_tensorboard_nov24_2020 /bin/bash.

I have --net=host option when running Docker so basically Docker environment is following the exact host network so I believe port issue is okay. Even Tensorboard works fine with normal usage without Profiler.

Just to verify, have you tried scrolling down the dropdown list of plugins? The profile plugin is at the bottom.
profile_dropdown

@qiuminxu Thank you for correcting me. I didn't realize that Profile tab is at the bottom.

I scrolled down the dropdown menu as you suggested. Tensorboard is trying to load some data but it is freezing. I think there are some network issues. It is frozen and ends up showing following logs in the shell.

TensorBoard 2.3.0 at http://deepbasegpu050:6009/ (Press CTRL+C to quit)
W1203 06:58:29.289239 140708716222208 security_validator.py:51] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'

Is Illegal Content-Security-Policy for script-src: 'unsafe-inline' another issue other than Tensorboard itself?

@qiuminxu Can you give me any updates on this issue?

The warning is benign. Can you right click on inspect and see if there is any error message in the javascript console?
Also can you upload a screenshot of the TensorBoard?

@qiuminxu I didn't see any error messages when I do right click on inspect. However, when I did, I see following error messages in the terminal, last three lines from the bottom below.

root:TensorFlowTTS_to_LPCTron_Dec02# cd ../profiler; python3 install_and_run.py --envdir=profile_env --logdir=/home/jun1.oh/daniel_workspace/TensorFlowTTS_Nov16/tboard_v3 --version=2.3
created virtual environment CPython3.6.9.final.0-64 in 10178ms
  creator CPython3Posix(dest=/home/jun1.oh/daniel_workspace/profiler/profile_env, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: Keras_Preprocessing==1.1.2, Markdown==3.3.3, Werkzeug==1.0.1, absl_py==0.11.0, astunparse==1.6.3, cachetools==4.1.1, certifi==2020.11.8, chardet==3.0.4, flatbuffers==1.12, gast==0.3.3, google_auth==1.23.0, google_auth_oauthlib==0.4.2, google_pasta==0.2.0, grpcio==1.32.0, gviz_api==1.9.0, h5py==2.10.0, idna==2.10, importlib_metadata==3.1.0, numpy==1.18.5, oauthlib==3.1.0, opt_einsum==3.3.0, pip==20.2.4, protobuf==3.13.0, pyasn1==0.4.8, pyasn1_modules==0.2.8, requests==2.25.0, requests_oauthlib==1.3.0, rsa==4.6, scipy==1.4.1, setuptools==50.3.2, six==1.15.0, tensorboard==2.3.0, tensorboard_plugin_profile==2.3.0, tensorboard_plugin_wit==1.7.0, tensorflow==2.3.0, tensorflow_estimator==2.3.0, termcolor==1.1.0, typing_extensions==3.7.4.3, urllib3==1.26.2, wheel==0.35.1, wrapt==1.12.1, zipp==3.4.0
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
WARNING: Skipping tbp-nightly as it is not installed.
WARNING: Skipping tb-nightly as it is not installed.
WARNING: Skipping tf-estimator-nightly as it is not installed.
WARNING: Skipping tf-nightly as it is not installed.
WARNING: You are using pip version 20.2.4; however, version 20.3.1 is available.
You should consider upgrading via the '/home/jun1.oh/daniel_workspace/profiler/profile_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.2.4; however, version 20.3.1 is available.
You should consider upgrading via the '/home/jun1.oh/daniel_workspace/profiler/profile_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.2.4; however, version 20.3.1 is available.
You should consider upgrading via the '/home/jun1.oh/daniel_workspace/profiler/profile_env/bin/python -m pip install --upgrade pip' command.
2020-12-15 11:11:07.262480: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
TensorBoard 2.3.0 at http://deepbasegpu050:6006/ (Press CTRL+C to quit)
W1215 11:12:06.562613 140515802597120 security_validator.py:51] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
W1215 11:12:52.415842 140515802597120 application.py:557] path /styles.css.map not found, sending 404
W1215 11:12:52.418248 140515693557504 application.py:557] path /tbdev_upload_button_component.css.map not found, sending 404
W1215 11:12:52.419265 140515819382528 application.py:557] path /plugin_selector_component.css.map not found, sending 404 

I attached a screenshot of the TensorBoard as well.
capture1

hmm, the three error messages look related to other plugins, not the profile plugin.
Can you try python3 profiler/install_and_run.py --envdir=profile_env --logdir=profiler/demo and see if tensorboard works using the demo trace?

@qiuminxu Yes. I am not an expert of these plugins issues so it is very hard to debug. Would I rather try the Profiler in a different server environment?

This is a log when trying python3 profile/install_and_run.py --envdir=profile_env --logdir=profiler/demo --version=2.3. Logs look exactly same.

root:daniel_workspace# python3 profiler/install_and_run.py --envdir=profile_env --logdir=profiler/demo --version=2.3
created virtual environment CPython3.6.9.final.0-64 in 3599ms
  creator CPython3Posix(dest=/home/jun1.oh/daniel_workspace/profile_env, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==20.2.4, setuptools==50.3.2, wheel==0.35.1
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
WARNING: Skipping tensorboard-plugin-profile as it is not installed.
WARNING: Skipping tensorboard as it is not installed.
WARNING: Skipping tensorflow-estimator as it is not installed.
WARNING: Skipping tensorflow as it is not installed.
WARNING: Skipping tbp-nightly as it is not installed.
WARNING: Skipping tb-nightly as it is not installed.
WARNING: Skipping tf-estimator-nightly as it is not installed.
WARNING: Skipping tf-nightly as it is not installed.
WARNING: You are using pip version 20.2.4; however, version 20.3.3 is available.
You should consider upgrading via the '/home/jun1.oh/daniel_workspace/profile_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.2.4; however, version 20.3.3 is available.
You should consider upgrading via the '/home/jun1.oh/daniel_workspace/profile_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.2.4; however, version 20.3.3 is available.
You should consider upgrading via the '/home/jun1.oh/daniel_workspace/profile_env/bin/python -m pip install --upgrade pip' command.
2020-12-16 02:15:53.871046: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
TensorBoard 2.3.0 at http://deepbasegpu050:6006/ (Press CTRL+C to quit)
W1216 02:18:18.550723 140379008415488 security_validator.py:51] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
W1216 02:18:29.976606 140380057544448 application.py:557] path /styles.css.map not found, sending 404
W1216 02:18:29.978160 140379008415488 application.py:557] path /tbdev_upload_button_component.css.map not found, sending 404
W1216 02:18:29.978945 140379033593600 application.py:557] path /plugin_selector_component.css.map not found, sending 404

Yeah, I think this is environment issue. Can you copy the log locally to your laptop or desktop and try again?

@qiuminxu Thanks :) Let me try it again in a different envronment.

@jun-danieloh, did you get the issue fixed?