host on AWS, use multiple GPUs

Question

host on AWS, use multiple GPUs

andrewljohnson opened this issue 9 years ago · comments

Andrew L. Johnson commented 9 years ago

Here's a FOSS repo doing TensorFlow on AWS using Docker. This is a familiar stack.

Google Cloud also released their GPU offering recently.

Andrew L. Johnson · Answer 1 · Wed May 11 2016 04:59:48 GMT+0800 (China Standard Time)

It would have been cool to use Google Cloud, but they don't seem to want to give any of us access.

Zain Memon · Answer 2 · Fri May 13 2016 21:31:06 GMT+0800 (China Standard Time)

Ok, here we go. I have a script that can be run on a fresh g2.2xlarge instance with Ubuntu 14.04 to bring it up to speed running Python 2.7, Tensorflow 0.8 w/ GPU support, using CUDA 7.5 and cuDNN v5-rc. Everything is installed using package managers and not compiled from source by hand, which is especially impressive if you consider it's using the latest versions of gdal and pylibosmium.

# spin up a g2.2xlarge with ubuntu 14.04
# before starting, scp the tarball for cudnn (cudnn-7.5-linux-x64-v5.0-rc.tgz) to /tmp

sudo add-apt-repository ppa:ubuntugis/ubuntugis-testing -y
sudo apt update

export LANGUAGE="en_US.UTF-8"
export LANG="en_US.UTF-8"
export LC_ALL="en_US.UTF-8"
locale-gen "en_US.UTF-8"
sudo dpkg-reconfigure locales

# blacklist nouveau gpu driver (in favor of CUDA)
echo -e "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u

# apt prerequisites
sudo apt install -y build-essential git swig default-jdk zip zlib1g-dev libbz2-dev python2.7 python2.7-dev cmake python-pip mercurial libffi-dev libssl-dev libxml2-dev libxslt1-dev libpq-dev libmysqlclient-dev libcurl4-openssl-dev libjpeg-dev libpng12-dev gfortran libblas-dev liblapack-dev libatlas-dev libquadmath0 libfreetype6-dev pkg-config libshp-dev libsqlite3-dev libgd2-xpm-dev libexpat1-dev libgeos-dev libgeos++-dev libxml2-dev libsparsehash-dev libv8-dev libicu-dev libgdal1-dev libprotobuf-dev protobuf-compiler devscripts debhelper fakeroot doxygen libboost-dev libboost-all-dev gdal-bin linux-image-extra-virtual linux-source

# cuda
cd /tmp
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.5-18_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb
sudo apt update
sudo apt install -y cuda
sudo apt install linux-headers-$(uname -r)
sudo reboot now  # <<<<<< reboot! 
sudo modprobe nvidia  # should return no errors

# cuDNN - assumes you already have the tarball in /tmp
cd /tmp
tar -xzf cudnn-7.5-linux-x64-v5.0-rc.tgz
sudo cp /tmp/cuda/lib64/* /usr/local/cuda/lib64
sudo cp /tmp/cuda/include/* /usr/local/cuda/include

# virtualenv
sudo pip install --upgrade pip
sudo pip install virtualenv
cd ~
virtualenv venv
source venv/bin/activate

# python prerequisites
pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
pip install gdal --global-option=build_ext --global-option="-I/usr/include/gdal/"

git clone --branch v2.6.1 https://github.com/osmcode/libosmium.git /tmp/libosmium
pip install --global-option=build_ext --global-option="-I/tmp/libosmium/include" git+https://github.com/osmcode/pyosmium@v2.6.0

At the end of all this, you can do the following and observe Tensorflow using the GPU:

$ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
$ export CUDA_HOME=/usr/local/cuda
$ source venv/bin/activate
$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
>>>

I created an AMI with the above script. You can spin up the AMI and run the following to clone and run DeepOSM:

# global vars that need to be set
export AWS_ACCESS_KEY_ID=***
export AWS_SECRET_ACCESS_KEY=***
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda
source ~/venv/bin/activate

# make a /data and /data/cache directory on the SSD for DeepOSM to use
sudo mkdir -p /mnt/data/cache
sudo ln -s /mnt/data /data
sudo chmod -R 777 /mnt/data
export GEO_DATA_DIR=/data

# DeepOSM
git clone https://github.com/trailbehind/DeepOSM.git /tmp/DeepOSM
cd /tmp/DeepOSM
ln -s /tmp/DeepOSM/s3config-default /home/ubuntu/.s3cfg
pip install -r requirements_gpu.txt
export PYTHONPATH=`pwd`

# now you can run DeepOSM scripts!
python bin/create_training_data.py

Zain Memon · Answer 3 · Fri May 13 2016 21:34:38 GMT+0800 (China Standard Time)

Now, a couple of questions for you all:

do you think I should add docker on top of all this? I think we get most of the benefits of docker (reusable, disposable) from AMIs, without having to debug another layer when something goes wrong.
what's the next step here, so you can start using it? Want me to share the AMI?

Andrew L. Johnson · Answer 4 · Sat May 14 2016 00:27:05 GMT+0800 (China Standard Time)

If we don't use Docker, won't that mean we have to maintain multiple
builds - one for AWS, one+ for local dev on Linux/Mac?
Is next step to be able to log in and run this?

python bin/create_training_data.py
python bin/run_analysis.py

Then I could compare the performance and experience to my Linux box, start
getting fidelity on how our AWS lab might feel to a user.

@silberman likes Jupyter notebooks a lot too - I think he sees us providing
a hosted Jupyter notebook to tinker with, for us or others?

Zain Memon · Answer 5 · Sat May 14 2016 01:18:09 GMT+0800 (China Standard Time)

I dunno. If we use docker, we'll need to maintain a build to install docker on ubuntu, and have to agree on some way to deploy containers. So I think we'll need a separate build for ubuntu no matter what -- either to install DeepOSM or to install docker. Correct me if you disagree here.
Yup, I'm running those two commands right now and debugging any errors that arise. (edit: done! both those commands finished successfully)

Andrew L. Johnson · Answer 6 · Mon May 30 2016 05:26:18 GMT+0800 (China Standard Time)

It seems like a good production solution could be:

Overpass instance - #39
RDS instance with NAIP data, plus separate Docker app to fill/edit the DB? - #23
Tensorflow analyzer - split as own app, per #30
an app for deeposm.org - web app for OSM review/edit - which I'm working on

Apps 1 & 2 provide an API to app 3, which publishes data to S3 for app 4 to imbibe into its own Django Postgres?

My guess is this production solution will start to be more of a requirement at scale... like it will be more convenient to do more than 1 state if we set up something like this, or provide more flexible analysis. We can go ahead and do deeposm.org/delaware, but then maybe have to get this done.

Andrew L. Johnson · Answer 7 · Thu Jun 23 2016 00:34:36 GMT+0800 (China Standard Time)

I need to get my head around this code and get my work moved to AWS
merging with other infrastructure issues