Recent years have seen tremendous academic effort go into the design and implementation of efficient neural networks to cope with the ever-increasing amount of data on ever-smaller and more efficient devices. Yet, as of the time of writing, most deep learning practitioners are unaware of even the most basic GPU acceleration techniques.
Especially in academia, many do not even use Automatic Mixed Precision (AMP), which can reduce memory requirements to 1/4 and increase speeds by x4~5. This is the case even though AMP can be enabled without much hassle using the PyTorch Lightning or HuggingFace Accelerate libraries.
Even the novice who has only just dipped their toes into the field of deep learning knows that more compute is a key ingredient for success. No matter how brilliant the scientist, outperforming a rival with x10 more compute is no mean feat.
This template was created with the aim of enabling researchers and engineers without much knowledge of GPUs, CUDA, Docker, etc. to squeeze every last drop of performance from their GPUs using the same hardware and neural networks.
Although Docker images with PyTorch source builds are already available in the official PyTorch Docker Hub repository and NVIDIA NGC repository, these images have a multitude of other packages installed with them, making it difficult to integrate them into pre-existing projects. Moreover, many practitioners prefer using local environments over Docker images.
The project presented here is different. It has no additional libraries to work with except for those installed by the user. Even better, the wheels generated by the build can be extracted for use in any environment, with no need to learn how to use Docker, though it also provides a Docker Compose file to make using Docker much easier as well.
If you are among those who before could but only yearn for a quicker end to the long hours spent staring at Tensorboard as your models inched past the epochs, this project may be just the thing. When using a source build of PyTorch with the latest version of CUDA, combined with AMP, one may achieve training/inference times x10 faster than a naïve PyTorch environment.
I sincerely hope that my project will be of service to practitioners in both academia and industry. Users who find my work beneficial are more than welcome to show their appreciation by starring this repository.
Before using this template, first check whether you are actually using your GPU!
In most scenarios, slow training is caused by an inefficient Extract, Transform, Load (ETL) pipeline.
Training is slow because the data is not getting to the GPU fast enough, not because the GPU is running slowly.
Run watch nvidia-smi
to check whether GPU utilization is high enough to justify compute optimizations.
If GPU utilization is low or peaks sporadically, design an efficient ETL pipeline before using this template.
Otherwise, faster compute will not help very much as it will not be the bottleneck.
See https://www.tensorflow.org/guide/data_performance for a guide on designing an efficient ETL pipeline.
A Template repository to build PyTorch from source on any version of PyTorch/CUDA/cuDNN.
PyTorch built from source is much faster (as much as x4 times on some benchmarks)
than PyTorch installed from pip
/conda
but building from source is a
difficult and bug-prone process.
This repository is a highly modular template to build any version of PyTorch from source on any version of CUDA. It provides an easy-to-use Dockerfile that can be integrated into any Linux-based image or project.
For researchers unfamiliar with Docker, the generated wheel files can be extracted to install PyTorch on their local environments.
Windows users may also use this project via WSL. See instructions below.
A Makefile
is provided both as an interface for easy use and as
a tutorial for building custom images.
A docker-compose.yaml
file is also provided for a simple interactive development experience.
The speed gains from this template come from the following factors:
- Using the latest version of CUDA and associated libraries (cuDNN, cuBLAS, etc.).
- Using a source build made especially for the target machine with the latest software customizations instead of a build that must be compatible with different hardware and software environments.
- Using the latest version of PyTorch and subsidiary libraries. Many users do not update their PyTorch version because of compatibility issues with their pre-existing environment.
- Informing users on where to look for solutions to their speed problems (this may be the most important factor).
Combined with techniques such as AMP and cuDNN benchmarking, computational throughput can be increased dramatically (e.g., x10) on the same hardware.
Even if you do not wish to use Docker in your project, you may still find this template useful.
The wheel files generated by the build can be used in any Python environment with no dependency on Docker.
This project can thus be used to generate custom wheel files,
improving both training and inference speeds dramatically,
for any desired environment (conda
, pip
, etc.).
Users are free to customize the train
stage of the Dockerfile
as they please.
However, do not change the build
stages unless absolutely necessary.
This project is a template, and users are expected to customize it to fit their needs.
The code is assumed to be running on a Linux host with the necessary NVIDIA Drivers and a recent version of Docker & Docker Compose pre-installed. If this is not the case, install these first.
To build a training image, first edit the Dockerfile train
stage to include
desired packages from apt
/conda
/pip
.
Then, visit https://developer.nvidia.com/cuda-gpus to find the Compute Capability (CC) of the target GPU device.
Finally, run make all CC=TARGET_CC(s)
.
(1) make all CC="8.6"
for RTX 3090,
(2) make all CC="7.5;8.6"
(no whitespace between CCs)
for both RTX 2080Ti and RTX 3090
(building for many GPU CCs will increase build time).
This will result in an image, pytorch_source:train
, which can be used for training.
Note that CCs for devices not available during the build can be used to build the image.
For example, if the image must be used on an RTX 2080Ti machine but the user only has an RTX 3090,
the user can set CC="7.5"
to enable the image to operate on the RTX 2080Ti GPU.
See https://pytorch.org/docs/stable/cpp_extension.html
for an in-depth guide on how to set TORCH_CUDA_ARCH_LIST
,
which is specified by CC
in the Makefile
.
The Makefile
is designed to make using this package simple and modular.
The first image to be created is pytorch_source:build_install
,
which contains all packages necessary for the build.
The installation image is created separately to prevent downloads from causing cache misses.
The second image is pytorch_source:build_torch-v1.9.1
(by default),
which contains the wheels for PyTorch, TorchVision, TorchText, and TorchAudio
with settings for PyTorch 1.9.1 on Ubuntu 20.04 LTS with Python 3.8, CUDA 11.3.1 and cuDNN 8.
The second image exists to cache the results of the build process.
If you do not wish to use Docker and would like to only extract
the .whl
wheel files for a pip install on your environment,
the generated wheel files can be found in the /tmp/dist
directory.
Saving the build results also allows for more convenient version switching in case different PyTorch versions (different CUDA version, different library version, etc.) are needed.
The final image is pytorch_source:train
, which is the image to be used for actual training.
It relies on the previous stages only for the build artifacts (wheels, etc.) and nothing else.
This makes it very simple to create different training images optimized for different environments and GPU devices.
Because PyTorch has already been built,
the training image only needs to download the
remaining apt
/conda
/pip
packages.
Moreover, caching is implemented to speed up even this process.
International users may find this section helpful.
The train
image has its timezone set by the TZ
variable using the tzdata
package.
The default timezone is Asia/Seoul
but this can be changed by specifying the TZ
variable when calling make
.
Use IANA timezone names to specify the desired timezone.
Example: make all CC="8.6" TZ=America/Los_Angeles
to use LA time on the training image.
NOTE: Only the training image has timezone settings. The installation and build images do not use timezone information.
In addition, the training image has apt
and pip
installation URLs updated for Korean users.
If you wish to speed up your installs, please find URLs optimized for your location,
though the installation caches may make this unnecessary.
PyTorch subsidiary libraries only work with matching versions of PyTorch.
To change the version of PyTorch,
set the PYTORCH_VERSION_TAG
, TORCHVISION_VERSION_TAG
,
TORCHTEXT_VERSION_TAG
, and TORCHAUDIO_VERSION_TAG
variables
to matching versions.
The *_TAG
variables must be GitHub tags or branch names of those repositories.
Visit the GitHub repositories of each library to find the appropriate tags.
Example: To build on an RTX 3090 GPU with PyTorch 1.9.1, use the following command:
make all CC="8.6" PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1
.
The resulting image, pytorch_source:train
, can be used
for training with PyTorch 1.9.1 on GPUs with Compute Capability 8.6.
To use multiple training images on the same host,
give a different name to TRAIN_NAME
,
which has a default value of train
.
New training images can be created without having to rebuild PyTorch if the same build image is used for different training images. Creating new training images takes only a few minutes at most.
This is useful for the following use cases.
- Allowing different users, who have different UID/GIDs, to use separate training images.
- Using different versions of the final training image with different library installations and configurations.
For example, if pytorch_source:build_torch-v1.9.1
has already been built,
Alice and Bob would use the following commands to create separate images.
Alice:
make build-train CC="8.6" TORCH_NAME=build_torch-v1.9.1 PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1 TRAIN_NAME=train_alice
Bob:
make build-train CC="8.6" TORCH_NAME=build_torch-v1.9.1 PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1 TRAIN_NAME=train_bob
This way, Alice's image would have her UID/GID while Bob's image would have his UID/GID. This procedure is necessary because training images have their users set during the build. Also, different users may install different libraries in their training images. Their environment variables and other settings may also be different.
When using build images such as pytorch_source:build_torch-v1.9.1
as a build cache
for creating new training images, the user must re-specify all build arguments
(variables specified by ARG and ENV using --build-arg) of all previous layers.
Otherwise, the default values for these arguments will be given to the Dockerfile and a cache miss will occur because of the different input values.
This will both waste time rebuilding previous layers and, more importantly, cause inconsistency in the training images due to environment mismatch.
This includes the docker-compose.yaml
file as well.
All arguments given to the Dockerfile
during the build must be respecified.
This includes default values present in the Makefile
but not present in the Dockerfile
such as the version tags.
If Docker starts to rebuild code that you have already built, suspect that build arguments have been given incorrectly.
See https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache for more information.
The BUILDKIT_INLINE_CACHE
must also be given to an image to use it as a cache later. See
https://docs.docker.com/engine/reference/commandline/build/#specifying-external-cache-sources
for more information.
The Makefile
provides the *-full
commands for advanced usage.
make all-full CC=YOUR_GPU_CC TRAIN_NAME=train_cu102
will create
pytorch_source:build_install-ubuntu18.04-cuda10.2-cudnn8-py3.9
,
pytorch_source:build_torch-v1.9.1-ubuntu18.04-cuda10.2-cudnn8-py3.9
,
and pytorch_source:train_cu102
by default.
These images can be used for training/deployment on CUDA 10 devices such as the GTX 1080Ti.
Also, the *-clean
commands are provided to check for cache reliance on previous builds.
Set CUDA_VERSION
, CUDNN_VERSION
, and MAGMA_VERSION
to change CUDA versions.
PYTHON_VERSION
may also be changed if necessary.
This will create a build image that can be used as a cache
to create training images with the build-train
command.
Also, the extensive use of caching in the project means that the second build is much faster than the first build. This may be advantageous if many images must be created for multiple PyTorch/CUDA versions.
CentOS and UBI images can be created with only minor edits to the Dockerfile
.
Read the Dockerfile
for full instructions.
Set the LINUX_DISTRO
and DISTRO_VERSION
arguments afterwards.
Windows users may use this template by updating to Windows 11 and installing Windows Subsystem for Linux (WSL). WSL on Windows 11 gives a similar experience to using native Linux.
This project has been tested on WSL on Windows 11 with the WSL CUDA driver and Docker Desktop for Windows.
Docker containers are designed to be transient and best practice dictates that
developers should create a new container for each run or command.
In practice, this is very inconvenient for development, especially for deep learning applications,
where libraries must be constantly installed and bugs are only evident at runtime.
However, developing in local environments with conda
or in individual containers risks
making the development environment unreproducible.
To alleviate this problem, a docker-compose.yaml
file is provided for easy management of containers.
Using Docker Compose V2 (see https://docs.docker.com/compose/cli-command),
run the following two commands, where train
is the default service name in the provided docker-compose.yaml
.
docker compose up -d train
docker compose exec train /bin/bash
This will open an interactive shell with settings specified by docker-compose.yaml
and .env
,
which is extremely convenient for managing reproducible development environments.
Variables can be saved in the .env
file, which should be placed on project root.
Variables such as the version tags and UID/GID values must be saved in .env
to use Docker Compose without cache misses or errors.
For example, if a new pip
or apt
package must be installed for the project,
users can edit the Dockerfile
's train
layer to install the necessary package and run the following command:
docker compose up -d --build train
.
This will rebuild the image and start a new container, but will not rebuild PyTorch if caches are set appropriately. Users thus need only wait for the additional downloads, which are also accelerated by caching and with fast mirror URLs.
To remove the containers, use docker compose down
.
Users with remote servers may use Docker contexts (see https://docs.docker.com/engine/context/working-with-contexts) to access their containers from their local environments.
For more information on Docker Compose, read the documentation https://github.com/compose-spec/compose-spec/blob/master/spec.md.
-
Entering a container by
ssh
will remove all variables set byENV
. This is becausesshd
starts a new environment, wiping out all previous variables. Usingdocker
/docker-compose
to enter containers is strongly recommended. -
Building on CUDA 11.4.x is not available as of October 2021 because
magma-cuda114
has not been released on thepytorch
anaconda channel yet. Users may attempt building with older versions ofmagma-cuda
or try the version available onconda-forge
. A source build ofmagma
would be welcomed as a pull request. -
Ubuntu 16.04 build fails. This is because the default
git
installed byapt
on Ubuntu 16.04 does not support the--jobs
flag. Add thegit-core
ppa toapt
and install the latest version of git. Also, PyTorch v1.9+ will not build on Ubuntu 16. Lower the version tag to v1.8.2. to build. However, as Ubuntu 16.04 has already reached EOL, the project will be left as is.
-
MORE STARS. If you are reading this, star this repository immediately. I'm serious.
-
CentOS and UBI images have not been implemented yet. As they require only simple modifications, pull requests implementing them would be very much welcome.
-
Translations into other languages are welcome. Please create a separate
LANG.README.md
file and create a PR.