BaguaSys / bagua

Bagua Speeds up PyTorch

Home Page:https://tutorials-8ro.pages.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cannot find libnccl.so.2

lixiangMindSpore opened this issue · comments

Describe the bug
A clear and concise description of what the bug is.
image

Environment

  • Your operating system and version: Ubuntu18.04
  • Your python version:3.8
  • Your PyTorch version:11.0
  • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?:
  • conda create -n torch17 python=3.8
  • Have you tried using latest bagua master (python3 -m pip install git+https://github.com/BaguaSys/bagua.git -f https://repo.arrayfire.com/python/wheels/3.8.0/ )?:I use 0.8.1.post1

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Additional context
Add any other context about the problem here.

commented

Thanks for opening the issue. Bagua cannot find NCCL installation on your system in this case. Have you tried to follow the error message's instruction by running import bagua_core; bagua_core.install_deps() in your Python interpreter? It will help install needed system libraries.

Thanks for opening the issue. Bagua cannot find NCCL installation on your system in this case. Have you tried to follow the error message's instruction by running import bagua_core; bagua_core.install_deps() in your Python interpreter? It will help install needed system libraries.

I run bagua_install_deps.py and solve the problem. Thank you so much!

commented

You're welcome :)

Python 3.8.0 (default, Feb 25 2021, 22:10:10) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import bagua
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-b6bb5bf6d045> in <module>
----> 1 import bagua

~/python38/lib/python3.8/site-packages/bagua/__init__.py in <module>
     10 """
     11 
---> 12 import bagua_core  # noqa: F401
     13 from .version import __version__  # noqa: F401

~/python38/lib/python3.8/site-packages/bagua_core/__init__.py in <module>
      2 
      3 _environment._preload_libraries()
----> 4 from .bagua_core import *  # noqa: F401,E402,F403
      5 from .bagua_install_deps import install_deps  # noqa: F401,E402,F403

ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

I got the same error with bagua-cuda116 using virtualenv. running bagua_install_deps.py failed for me.

bagua_install_deps.py 
import-im6.q16: not authorized `os' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `platform' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `shutil' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `tempfile' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `pathlib' @ error/constitute.c/WriteImage/1037.
from: too many arguments
/home/xxx/python38/bin/bagua_install_deps.py: line 10: _nccl_records: command not found
/home/xxx/python38/bin/bagua_install_deps.py: line 11: library_records: command not found
/home/xxx/python38/bin/bagua_install_deps.py: line 14: syntax error near unexpected token `('
/home/xxx/python38/bin/bagua_install_deps.py: line 14: `class DownloadProgressBar(tqdm):'

bagua-cuda116 was built differently with other cuda release.

bagua-cuda116                 0.8.3.dev215

@Godricly Which python version did you use to run bagua_install_deps.py?

Maybe you can try: python3 bagua_install_deps.py?

I tried on an other machine with cuda113 and nccl, which works well for me.
I think the problem is that nccl is not installed. Also that bagua-cuda116 version should be updated.