NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Install Failure on GCP Deep Learning VM

glenn-jocher opened this issue · comments

I created a simple GCP Deep Learning VM:
https://cloud.google.com/deep-learning-vm/

I followed the install directions, and the install failed with errors:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

...
Command "/opt/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-j0qgf5ds/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, 
__file__, 'exec'))" --cpp_ext --cuda_ext install --record /tmp/pip-record-1yr2fag5/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-j0qgf5ds/
Exception information:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 143, in main
    status = self.run(options, args)
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 366, in run
    use_user_site=options.use_user_site,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/__init__.py", line 49, in install_given_reqs
    **kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 791, in install
    spinner=spinner,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 705, in call_subprocess
    % (command_desc, proc.returncode, cwd))

The Python-only option also failed:

pip install -v --no-cache-dir .

...
Command "/opt/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-eedemek6/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, 
__file__, 'exec'))" install --record /tmp/pip-record-ehl5a4y7/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-eedemek6/
Exception information:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 143, in main
    status = self.run(options, args)
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 366, in run
    use_user_site=options.use_user_site,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/__init__.py", line 49, in install_given_reqs
    **kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 791, in install
    spinner=spinner,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 705, in call_subprocess
    % (command_desc, proc.returncode, cwd))

It would seem like installation on a GCP Deep Learning VM would be one of the tested use cases here no?? If it doesn't work there of all places, where is it intended to work?

I'm not sure if this issue is specific to apex. I think you need to make sure your instance has python-dev:
google/python-subprocess32#38
See also
https://medium.com/giscle/setting-up-a-google-cloud-instance-for-deep-learning-d182256cb894
(scroll down to "Installing Tensorflow," which is not directly relevant, but does also say to sudo apt-get install python3-pip python3-dev).

Also, I don't think this issue is related to cpp extension building in particular. I think if the suggested fix resolves your issue for the Python-only build, the cpp and cuda extension build is definitely worth another try.

@mcarilli ah, thanks for the reply! I tried what you said, but they seem to be already installed. For completeness I included all the header information from the VM when it starts up below. These VMs come with PyTorch (and almost everything else) preinstalled. We use them in our GCP Quickstart Guide on our YOLOv3 repo:
https://github.com/ultralytics/yolov3/wiki/GCP-Quickstart

Version: m23
Based on: Debian GNU/Linux 9.8 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
Resources:
 * Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questi
ons/tagged/google-dl-platform
 * Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
 * Google Group: https://groups.google.com/forum/#!forum/google-dl-platform

To reinstall Nvidia driver (if needed) run:
sudo /opt/deeplearning/install-driver.sh

This image uses python 3.7 from the Anaconda. Anaconda is installed to:
/opt/anaconda3/

Linux instance-2 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

ultralytics@instance-2:~$ sudo apt-get install python3-pip python3-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-pip is already the newest version (9.0.1-2).
python3-dev is already the newest version (3.5.3-1).
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.

Following https://medium.com/giscle/setting-up-a-google-cloud-instance-for-deep-learning-d182256cb894, maybe the solution is as simple as using pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . instead of pip install ... Before February, Apex served primarily as a research/internal toolkit. The growth in popularity was been a (pleasant) surprise. I only recently made it my fulltime project and I haven't actually tested the install on google cloud before so this is valuable information.

@mcarilli thanks, the change worked. The line I used to successfully install is:

pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

Unfortunately after install the apex module can be found, but not amp:

...
  running install_egg_info
    running egg_info
    creating apex.egg-info
    writing apex.egg-info/PKG-INFO
    writing top-level names to apex.egg-info/top_level.txt
    writing dependency_links to apex.egg-info/dependency_links.txt
    writing manifest file 'apex.egg-info/SOURCES.txt'
    reading manifest file 'apex.egg-info/SOURCES.txt'
    writing manifest file 'apex.egg-info/SOURCES.txt'
    Copying apex.egg-info to /home/ultralytics/.local/lib/python3.5/site-packages/apex-0.1-py3.5.egg-info
    running install_scripts
    writing list of installed files to '/tmp/pip-ln69wwvt-record/install-record.txt'
done
  Removing source in /tmp/pip-5vfngf45-build
Successfully installed apex-0.1
Cleaning up...

ultralytics@instance-2:~/apex$ cd ..
ultralytics@instance-2:~$ python3 -c "import apex"
ultralytics@instance-2:~$ python3 -c "from apex import amp"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'amp' from 'apex' (unknown location)
ultralytics@instance-2:~$ python3 -c "import apex; a=apex.amp"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'apex' has no attribute 'amp'

This may be an artifact of where you tried to run import apex (from one level above the apex repo directory). I think when you say import apex from ~, Python is attempting to import the cloned repo directory called apex which is obviously not the right thing.

Try this, starting in the apex repo directory:

ultralytics@instance-2:~/apex$ cd ..
ultralytics@instance-2:~$ python
...
>>> import apex
>>> import sys
>>> sys.modules['apex']

should show where the files are being imported from, which should be some system install path, e.g. on my system

>>> sys.modules['apex']
<module 'apex' from '/home/mcarilli/anaconda3/lib/python3.6/site-packages/apex/__init__.py'>
>>>

After installing, you can also try running the L0 tests:

cd tests/L0
python run_test.py

They should all pass if you installed with cpp/cuda extensions.

@mcarilli ah yes you are right! It was importing from the cloned repo. After I removed the /apex repo it would not longer import apex.

I'm starting to think this is a conda install issue (the GCP Deep Learning VMs use Anaconda 3.7). From these directions on installing non-conda packages I activated the conda environment first before trying the install. Install was successful but then the package is missing from conda list, and import fails. I think somehow I need to direct it to install to opt/anaconda3, because I see in the install output instead a mention of a seperate python 3.5: Copying apex.egg-info to /home/ultralytics/.local/lib/python3.5/site-packages/apex-0.1-py3.5.egg-info

ultralytics@instance-2:~$ conda info --envs
WARNING: The conda.compat module is deprecated and will be removed in a future release.
# conda environments:
#
base                  *  /opt/anaconda3
ultralytics@instance-2:~$ source activate base
(base) ultralytics@instance-2:~$ git clone https://github.com/NVIDIA/apex
(base) ultralytics@instance-2:~$ cd apex
(base) ultralytics@instance-2:~/apex$ pip3 install -v --no-cache-dir .
...
Successfully installed apex-0.1
Cleaning up...
(base) ultralytics@instance-2:~/apex$ cd .. && rm -rf apex
(base) ultralytics@instance-2:~$ python3
Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from apex import amp
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'apex'
>>> 

Hmm, if I try this on my local machine, it appears to install to the correct location. I'm not sure what's different/lacking about the conda environment on the GCP instance...

apex_fresh$ source activate base
(base) apex_fresh$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
...
Copying apex.egg-info to /home/mcarilli/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6.egg-info
...
(base) apex_fresh$ cd ../..
(base) Desktop$ python
Python 3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 19:16:44) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import apex
>>> from apex import amp
>>> 

Did you ever figure out why conda on GCP was installing to the wrong directory? I'm not a conda expert so if you managed to resolve this issue it will be helpful for future users.

No, no luck. I created a blank PyTorch deep learning VM and tried again from scratch, but it's installing to a different python 3.5 rather than anaconda. It seems to be an anaconda issue, and unfortunately I'm not the best conda expert either. I think pip installs to conda are generally not always problem free, I've seen other repos with conda-specific install instructions.
creating /home/ultralytics/.local/lib/python3.5/site-packages/apex/amp

Screenshot 2019-04-24 at 15 17 35

In your above example, you see apex in your conda list right?

Yes:

(base) apex_fresh$ conda list | grep apex
apex                      0.1                       <pip>

When I've had issues using pip installs in conda environments in the past, I've sometimes resolved them by explicitly running conda install pip within the conda environment before doing pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . within the same environment.

Is it possible to use a Docker container on the gcp instance as a potential workaround? There are several options for Docker containers in which we test the Apex install regularly: https://github.com/NVIDIA/apex/tree/master/examples/docker

Even if Docker containers succeed, this does not alleviate the importance of having the bare-metal Apex install also work. I'll consult some people who have more experience with conda.

My guess that it's installing to a python 3.5 because it's using' OS's pip3 version 3.5, rather than conda's python 3.7, you can confirm by running pip3 --version and python --version.

@ngimel yes, you are correct! pip itself directs correctly to anaconda3/lib/python3.7, but pip3 is directing to a local python3.5.

glenn@instance-1:~$ pip3 --version
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.5)
glenn@instance-1:~$ python --version
Python 3.7.3
glenn@instance-1:~$ pip --version
pip 19.0.3 from /opt/anaconda3/lib/python3.7/site-packages/pip (python 3.7)

@mcarilli so I understand the situation now

  • pip attempts to install to the correct location (conda's python 3.7), but install fails
  • pip3 installs to an incorrect location (OS python 3.5), but is inaccessable from the conda env.

Yes, if you could get someone to spin up a PyTorch 1.1 VM in GCP and work through the apex install that would help tremendously. Docker might be a fallback, but I think might also be a bridge too far for many users.

I can't repro on the latest pytorch vm (Pytorch 1.1 + fastai 1.0 (CUDA 10.0))

(base) root@tensorflow-1-vm:~/apex# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
...
...
...
    writing list of installed files to '/tmp/pip-record-b_l2sreu/install-record.txt'
done
  Removing source in /tmp/pip-req-build-koc3g9j3
Successfully installed apex-0.1
Cleaning up...

@ngimel I just checked on a new PyTorch 1.1 vm. This time I got a permission denied error:
error: could not create '/opt/anaconda3/lib/python3.7/site-packages/apex': Permission denied

so I tried to use sudo pip install -v --no-cache-dir . which installs without error, but to the incorrect python 2.7. So I still can not install apex to Anaconda 3.7.
Copying apex.egg-info to /usr/local/lib/python2.7/dist-packages/apex-0.1-py2.7.egg-info

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
sudo pip install -v --no-cache-dir .
commented

Using --user worked for me with python3. I guess that is what you get from using 3 different python versions.

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

@see-- this works! I was able to successfully install on a GCP VM with the following commands:

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir . --user

UPDATE 1: On running a mixed precision model with the above install I get the following warning: Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.

Installing instead with the following line removed the warning:

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

Excellent, thanks guys. Sorry I haven't had time to do a deep dive myself, but i'm pinning this issue for others.

For posterity, I was only able to get this to work (after trying many other things) with:

sudo pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

(note the sudo.)

I had to use Conda forge to get this working within my conda environment

conda install -c conda-forge nvidia-apex

commented

Using --user worked for me with python3. I guess that is what you get from using 3 different python versions.

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

@see-- @glenn-jocher @sleepinyourhat
if we install with above mentioned command can we import it in both python2.xx and python3.xx? stuck here. My project used python3.xx and i am unable to install with pip3 before i tried with pip but without --user and i was able to import in python2.xx but I need it with python3.xx .
So if i install with pip and --user it will install for all python versions?

My environment
ubuntu 16.04
CUDA Version 10.0.130
CuDNN 7.4.1
torch.version '1.3.1'
Python 3.5.2