Install Failure on GCP Deep Learning VM

Question

Install Failure on GCP Deep Learning VM

glenn-jocher opened this issue 5 years ago · comments

I created a simple GCP Deep Learning VM:
https://cloud.google.com/deep-learning-vm/

I followed the install directions, and the install failed with errors:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

...
Command "/opt/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-j0qgf5ds/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, 
__file__, 'exec'))" --cpp_ext --cuda_ext install --record /tmp/pip-record-1yr2fag5/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-j0qgf5ds/
Exception information:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 143, in main
    status = self.run(options, args)
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 366, in run
    use_user_site=options.use_user_site,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/__init__.py", line 49, in install_given_reqs
    **kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 791, in install
    spinner=spinner,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 705, in call_subprocess
    % (command_desc, proc.returncode, cwd))

The Python-only option also failed:

pip install -v --no-cache-dir .

...
Command "/opt/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-eedemek6/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, 
__file__, 'exec'))" install --record /tmp/pip-record-ehl5a4y7/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-eedemek6/
Exception information:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 143, in main
    status = self.run(options, args)
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 366, in run
    use_user_site=options.use_user_site,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/__init__.py", line 49, in install_given_reqs
    **kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 791, in install
    spinner=spinner,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 705, in call_subprocess
    % (command_desc, proc.returncode, cwd))

It would seem like installation on a GCP Deep Learning VM would be one of the tested use cases here no?? If it doesn't work there of all places, where is it intended to work?

Michael Carilli · Answer 1 · Thu Apr 18 2019 05:26:03 GMT+0800 (China Standard Time)

I'm not sure if this issue is specific to apex. I think you need to make sure your instance has python-dev:
google/python-subprocess32#38
See also
https://medium.com/giscle/setting-up-a-google-cloud-instance-for-deep-learning-d182256cb894
(scroll down to "Installing Tensorflow," which is not directly relevant, but does also say to sudo apt-get install python3-pip python3-dev).

Also, I don't think this issue is related to cpp extension building in particular. I think if the suggested fix resolves your issue for the Python-only build, the cpp and cuda extension build is definitely worth another try.

Glenn Jocher · Answer 2 · Thu Apr 18 2019 06:50:58 GMT+0800 (China Standard Time)

@mcarilli ah, thanks for the reply! I tried what you said, but they seem to be already installed. For completeness I included all the header information from the VM when it starts up below. These VMs come with PyTorch (and almost everything else) preinstalled. We use them in our GCP Quickstart Guide on our YOLOv3 repo:
https://github.com/ultralytics/yolov3/wiki/GCP-Quickstart

Version: m23
Based on: Debian GNU/Linux 9.8 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
Resources:
 * Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questi
ons/tagged/google-dl-platform
 * Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
 * Google Group: https://groups.google.com/forum/#!forum/google-dl-platform

To reinstall Nvidia driver (if needed) run:
sudo /opt/deeplearning/install-driver.sh

This image uses python 3.7 from the Anaconda. Anaconda is installed to:
/opt/anaconda3/

Linux instance-2 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

ultralytics@instance-2:~$ sudo apt-get install python3-pip python3-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-pip is already the newest version (9.0.1-2).
python3-dev is already the newest version (3.5.3-1).
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.

Michael Carilli · Answer 3 · Thu Apr 18 2019 07:23:16 GMT+0800 (China Standard Time)

Following https://medium.com/giscle/setting-up-a-google-cloud-instance-for-deep-learning-d182256cb894, maybe the solution is as simple as using pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . instead of pip install ... Before February, Apex served primarily as a research/internal toolkit. The growth in popularity was been a (pleasant) surprise. I only recently made it my fulltime project and I haven't actually tested the install on google cloud before so this is valuable information.

Glenn Jocher · Answer 4 · Thu Apr 18 2019 22:58:11 GMT+0800 (China Standard Time)

@mcarilli thanks, the change worked. The line I used to successfully install is:

pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

Unfortunately after install the apex module can be found, but not amp:

...
  running install_egg_info
    running egg_info
    creating apex.egg-info
    writing apex.egg-info/PKG-INFO
    writing top-level names to apex.egg-info/top_level.txt
    writing dependency_links to apex.egg-info/dependency_links.txt
    writing manifest file 'apex.egg-info/SOURCES.txt'
    reading manifest file 'apex.egg-info/SOURCES.txt'
    writing manifest file 'apex.egg-info/SOURCES.txt'
    Copying apex.egg-info to /home/ultralytics/.local/lib/python3.5/site-packages/apex-0.1-py3.5.egg-info
    running install_scripts
    writing list of installed files to '/tmp/pip-ln69wwvt-record/install-record.txt'
done
  Removing source in /tmp/pip-5vfngf45-build
Successfully installed apex-0.1
Cleaning up...

ultralytics@instance-2:~/apex$ cd ..
ultralytics@instance-2:~$ python3 -c "import apex"
ultralytics@instance-2:~$ python3 -c "from apex import amp"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'amp' from 'apex' (unknown location)
ultralytics@instance-2:~$ python3 -c "import apex; a=apex.amp"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'apex' has no attribute 'amp'

Michael Carilli · Answer 5 · Thu Apr 18 2019 23:11:52 GMT+0800 (China Standard Time)

This may be an artifact of where you tried to run import apex (from one level above the apex repo directory). I think when you say import apex from ~, Python is attempting to import the cloned repo directory called apex which is obviously not the right thing.

Try this, starting in the apex repo directory:

ultralytics@instance-2:~/apex$ cd ..
ultralytics@instance-2:~$ python
...
>>> import apex
>>> import sys
>>> sys.modules['apex']

should show where the files are being imported from, which should be some system install path, e.g. on my system

>>> sys.modules['apex']
<module 'apex' from '/home/mcarilli/anaconda3/lib/python3.6/site-packages/apex/__init__.py'>
>>>

After installing, you can also try running the L0 tests:

cd tests/L0
python run_test.py

They should all pass if you installed with cpp/cuda extensions.

Glenn Jocher · Answer 6 · Thu Apr 18 2019 23:41:00 GMT+0800 (China Standard Time)

@mcarilli ah yes you are right! It was importing from the cloned repo. After I removed the /apex repo it would not longer import apex.

I'm starting to think this is a conda install issue (the GCP Deep Learning VMs use Anaconda 3.7). From these directions on installing non-conda packages I activated the conda environment first before trying the install. Install was successful but then the package is missing from conda list, and import fails. I think somehow I need to direct it to install to opt/anaconda3, because I see in the install output instead a mention of a seperate python 3.5: Copying apex.egg-info to /home/ultralytics/.local/lib/python3.5/site-packages/apex-0.1-py3.5.egg-info

ultralytics@instance-2:~$ conda info --envs
WARNING: The conda.compat module is deprecated and will be removed in a future release.
# conda environments:
#
base                  *  /opt/anaconda3
ultralytics@instance-2:~$ source activate base
(base) ultralytics@instance-2:~$ git clone https://github.com/NVIDIA/apex
(base) ultralytics@instance-2:~$ cd apex
(base) ultralytics@instance-2:~/apex$ pip3 install -v --no-cache-dir .
...
Successfully installed apex-0.1
Cleaning up...
(base) ultralytics@instance-2:~/apex$ cd .. && rm -rf apex
(base) ultralytics@instance-2:~$ python3
Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from apex import amp
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'apex'
>>>

Michael Carilli · Answer 7 · Fri Apr 19 2019 06:16:03 GMT+0800 (China Standard Time)

Hmm, if I try this on my local machine, it appears to install to the correct location. I'm not sure what's different/lacking about the conda environment on the GCP instance...

apex_fresh$ source activate base
(base) apex_fresh$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
...
Copying apex.egg-info to /home/mcarilli/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6.egg-info
...
(base) apex_fresh$ cd ../..
(base) Desktop$ python
Python 3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 19:16:44) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import apex
>>> from apex import amp
>>>

Michael Carilli · Answer 8 · Wed Apr 24 2019 10:55:50 GMT+0800 (China Standard Time)

Did you ever figure out why conda on GCP was installing to the wrong directory? I'm not a conda expert so if you managed to resolve this issue it will be helpful for future users.

Glenn Jocher · Answer 9 · Wed Apr 24 2019 21:27:24 GMT+0800 (China Standard Time)

No, no luck. I created a blank PyTorch deep learning VM and tried again from scratch, but it's installing to a different python 3.5 rather than anaconda. It seems to be an anaconda issue, and unfortunately I'm not the best conda expert either. I think pip installs to conda are generally not always problem free, I've seen other repos with conda-specific install instructions.
creating /home/ultralytics/.local/lib/python3.5/site-packages/apex/amp

In your above example, you see apex in your conda list right?

Michael Carilli · Answer 10 · Tue Apr 30 2019 00:59:02 GMT+0800 (China Standard Time)

Yes:

(base) apex_fresh$ conda list | grep apex
apex                      0.1                       <pip>

When I've had issues using pip installs in conda environments in the past, I've sometimes resolved them by explicitly running conda install pip within the conda environment before doing pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . within the same environment.

Michael Carilli · Answer 11 · Sat May 11 2019 07:24:52 GMT+0800 (China Standard Time)

Is it possible to use a Docker container on the gcp instance as a potential workaround? There are several options for Docker containers in which we test the Apex install regularly: https://github.com/NVIDIA/apex/tree/master/examples/docker

Even if Docker containers succeed, this does not alleviate the importance of having the bare-metal Apex install also work. I'll consult some people who have more experience with conda.

Natalia Gimelshein · Answer 12 · Sat May 11 2019 08:04:05 GMT+0800 (China Standard Time)

My guess that it's installing to a python 3.5 because it's using' OS's pip3 version 3.5, rather than conda's python 3.7, you can confirm by running pip3 --version and python --version.

Glenn Jocher · Answer 13 · Sat May 11 2019 17:11:12 GMT+0800 (China Standard Time)

@ngimel yes, you are correct! pip itself directs correctly to anaconda3/lib/python3.7, but pip3 is directing to a local python3.5.

glenn@instance-1:~$ pip3 --version
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.5)
glenn@instance-1:~$ python --version
Python 3.7.3
glenn@instance-1:~$ pip --version
pip 19.0.3 from /opt/anaconda3/lib/python3.7/site-packages/pip (python 3.7)

@mcarilli so I understand the situation now

pip attempts to install to the correct location (conda's python 3.7), but install fails
pip3 installs to an incorrect location (OS python 3.5), but is inaccessable from the conda env.

Yes, if you could get someone to spin up a PyTorch 1.1 VM in GCP and work through the apex install that would help tremendously. Docker might be a fallback, but I think might also be a bridge too far for many users.

Natalia Gimelshein · Answer 14 · Wed May 15 2019 07:09:47 GMT+0800 (China Standard Time)

I can't repro on the latest pytorch vm (Pytorch 1.1 + fastai 1.0 (CUDA 10.0))

(base) root@tensorflow-1-vm:~/apex# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
...
...
...
    writing list of installed files to '/tmp/pip-record-b_l2sreu/install-record.txt'
done
  Removing source in /tmp/pip-req-build-koc3g9j3
Successfully installed apex-0.1
Cleaning up...

Glenn Jocher · Answer 15 · Sat May 18 2019 18:06:45 GMT+0800 (China Standard Time)

@ngimel I just checked on a new PyTorch 1.1 vm. This time I got a permission denied error:
error: could not create '/opt/anaconda3/lib/python3.7/site-packages/apex': Permission denied

so I tried to use sudo pip install -v --no-cache-dir . which installs without error, but to the incorrect python 2.7. So I still can not install apex to Anaconda 3.7.
Copying apex.egg-info to /usr/local/lib/python2.7/dist-packages/apex-0.1-py2.7.egg-info

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
sudo pip install -v --no-cache-dir .

see-- · Answer 16 · Fri Jun 07 2019 22:49:19 GMT+0800 (China Standard Time)

Using --user worked for me with python3. I guess that is what you get from using 3 different python versions.

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

Glenn Jocher · Answer 17 · Fri Jun 07 2019 23:13:21 GMT+0800 (China Standard Time)

@see-- this works! I was able to successfully install on a GCP VM with the following commands:

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir . --user

UPDATE 1: On running a mixed precision model with the above install I get the following warning: Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.

Installing instead with the following line removed the warning:

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

Michael Carilli · Answer 18 · Sat Jun 15 2019 06:24:28 GMT+0800 (China Standard Time)

Excellent, thanks guys. Sorry I haven't had time to do a deep dive myself, but i'm pinning this issue for others.

Sam Bowman · Answer 19 · Sat Jul 27 2019 03:10:03 GMT+0800 (China Standard Time)

For posterity, I was only able to get this to work (after trying many other things) with:

sudo pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

(note the sudo.)

Morgan McGuire · Answer 20 · Thu Oct 31 2019 16:50:12 GMT+0800 (China Standard Time)

I had to use Conda forge to get this working within my conda environment

conda install -c conda-forge nvidia-apex

Asad · Answer 21 · Wed Jan 08 2020 16:51:12 GMT+0800 (China Standard Time)

Using --user worked for me with python3. I guess that is what you get from using 3 different python versions.
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

@see-- @glenn-jocher @sleepinyourhat
if we install with above mentioned command can we import it in both python2.xx and python3.xx? stuck here. My project used python3.xx and i am unable to install with pip3 before i tried with pip but without --user and i was able to import in python2.xx but I need it with python3.xx .
So if i install with pip and --user it will install for all python versions?

My environment
ubuntu 16.04
CUDA Version 10.0.130
CuDNN 7.4.1
torch.version '1.3.1'
Python 3.5.2