host on AWS, use multiple GPUs
andrewljohnson opened this issue · comments
Here's a FOSS repo doing TensorFlow on AWS using Docker. This is a familiar stack.
It would have been cool to use Google Cloud, but they don't seem to want to give any of us access.
Ok, here we go. I have a script that can be run on a fresh g2.2xlarge
instance with Ubuntu 14.04 to bring it up to speed running Python 2.7, Tensorflow 0.8 w/ GPU support, using CUDA 7.5 and cuDNN v5-rc. Everything is installed using package managers and not compiled from source by hand, which is especially impressive if you consider it's using the latest versions of gdal and pylibosmium.
# spin up a g2.2xlarge with ubuntu 14.04
# before starting, scp the tarball for cudnn (cudnn-7.5-linux-x64-v5.0-rc.tgz) to /tmp
sudo add-apt-repository ppa:ubuntugis/ubuntugis-testing -y
sudo apt update
export LANGUAGE="en_US.UTF-8"
export LANG="en_US.UTF-8"
export LC_ALL="en_US.UTF-8"
locale-gen "en_US.UTF-8"
sudo dpkg-reconfigure locales
# blacklist nouveau gpu driver (in favor of CUDA)
echo -e "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u
# apt prerequisites
sudo apt install -y build-essential git swig default-jdk zip zlib1g-dev libbz2-dev python2.7 python2.7-dev cmake python-pip mercurial libffi-dev libssl-dev libxml2-dev libxslt1-dev libpq-dev libmysqlclient-dev libcurl4-openssl-dev libjpeg-dev libpng12-dev gfortran libblas-dev liblapack-dev libatlas-dev libquadmath0 libfreetype6-dev pkg-config libshp-dev libsqlite3-dev libgd2-xpm-dev libexpat1-dev libgeos-dev libgeos++-dev libxml2-dev libsparsehash-dev libv8-dev libicu-dev libgdal1-dev libprotobuf-dev protobuf-compiler devscripts debhelper fakeroot doxygen libboost-dev libboost-all-dev gdal-bin linux-image-extra-virtual linux-source
# cuda
cd /tmp
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.5-18_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb
sudo apt update
sudo apt install -y cuda
sudo apt install linux-headers-$(uname -r)
sudo reboot now # <<<<<< reboot!
sudo modprobe nvidia # should return no errors
# cuDNN - assumes you already have the tarball in /tmp
cd /tmp
tar -xzf cudnn-7.5-linux-x64-v5.0-rc.tgz
sudo cp /tmp/cuda/lib64/* /usr/local/cuda/lib64
sudo cp /tmp/cuda/include/* /usr/local/cuda/include
# virtualenv
sudo pip install --upgrade pip
sudo pip install virtualenv
cd ~
virtualenv venv
source venv/bin/activate
# python prerequisites
pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
pip install gdal --global-option=build_ext --global-option="-I/usr/include/gdal/"
git clone --branch v2.6.1 https://github.com/osmcode/libosmium.git /tmp/libosmium
pip install --global-option=build_ext --global-option="-I/tmp/libosmium/include" git+https://github.com/osmcode/pyosmium@v2.6.0
At the end of all this, you can do the following and observe Tensorflow using the GPU:
$ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
$ export CUDA_HOME=/usr/local/cuda
$ source venv/bin/activate
$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
>>>
I created an AMI with the above script. You can spin up the AMI and run the following to clone and run DeepOSM:
# global vars that need to be set
export AWS_ACCESS_KEY_ID=***
export AWS_SECRET_ACCESS_KEY=***
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda
source ~/venv/bin/activate
# make a /data and /data/cache directory on the SSD for DeepOSM to use
sudo mkdir -p /mnt/data/cache
sudo ln -s /mnt/data /data
sudo chmod -R 777 /mnt/data
export GEO_DATA_DIR=/data
# DeepOSM
git clone https://github.com/trailbehind/DeepOSM.git /tmp/DeepOSM
cd /tmp/DeepOSM
ln -s /tmp/DeepOSM/s3config-default /home/ubuntu/.s3cfg
pip install -r requirements_gpu.txt
export PYTHONPATH=`pwd`
# now you can run DeepOSM scripts!
python bin/create_training_data.py
Now, a couple of questions for you all:
- do you think I should add docker on top of all this? I think we get most of the benefits of docker (reusable, disposable) from AMIs, without having to debug another layer when something goes wrong.
- what's the next step here, so you can start using it? Want me to share the AMI?
-
If we don't use Docker, won't that mean we have to maintain multiple
builds - one for AWS, one+ for local dev on Linux/Mac? -
Is next step to be able to log in and run this?
python bin/create_training_data.py
python bin/run_analysis.py
Then I could compare the performance and experience to my Linux box, start
getting fidelity on how our AWS lab might feel to a user.
@silberman likes Jupyter notebooks a lot too - I think he sees us providing
a hosted Jupyter notebook to tinker with, for us or others?
- I dunno. If we use docker, we'll need to maintain a build to install docker on ubuntu, and have to agree on some way to deploy containers. So I think we'll need a separate build for ubuntu no matter what -- either to install DeepOSM or to install docker. Correct me if you disagree here.
- Yup, I'm running those two commands right now and debugging any errors that arise. (edit: done! both those commands finished successfully)
It seems like a good production solution could be:
- Overpass instance - #39
- RDS instance with NAIP data, plus separate Docker app to fill/edit the DB? - #23
- Tensorflow analyzer - split as own app, per #30
- an app for deeposm.org - web app for OSM review/edit - which I'm working on
Apps 1 & 2 provide an API to app 3, which publishes data to S3 for app 4 to imbibe into its own Django Postgres?
My guess is this production solution will start to be more of a requirement at scale... like it will be more convenient to do more than 1 state if we set up something like this, or provide more flexible analysis. We can go ahead and do deeposm.org/delaware, but then maybe have to get this done.
- I need to get my head around this code and get my work moved to AWS
- merging with other infrastructure issues