The Ultimate PyTorch Source-Build Template

TL;DR

PyTorch built from source can be x4 faster than a naïve PyTorch install. This repository provides a template for building PyTorch pip wheel binaries from source for any PyTorch version on any CUDA version on any environment. These can be used in any project environment, including on local conda environments, on any CUDA GPU.

A new paradigm for deep learning development using Docker Compose as an MLOps tool is also proposed. Hopefully, this method will become best practice in both academia and industry.

Preamble

Recent years have seen tremendous academic effort go into the design and implementation of efficient neural networks to cope with the ever-increasing amount of data on ever-smaller and more efficient devices. Yet, as of the time of writing, most deep learning practitioners are unaware of even the most basic GPU acceleration techniques.

Especially in academia, many do not even use Automatic Mixed Precision (AMP), which can reduce memory requirements to 1/4 and increase speeds by x4~5. This is the case even though AMP can be enabled without much hassle using the HuggingFace Accelerate or PyTorch Lightning libraries. The Accelerate library in particular can be integrated into any pre-existing PyTorch project with only a few lines of code.

Even the novice who has only just dipped their toes into the mysteries of deep learning knows that more compute is a key ingredient for success. No matter how brilliant the scientist, outperforming a rival with x10 more compute is no mean feat.

This template was created with the aim of enabling researchers and engineers without much knowledge of GPUs, CUDA, Docker, etc. to squeeze every last drop of performance from their GPUs using the same hardware and neural networks.

Although Docker images with PyTorch source builds are already available in the official PyTorch Docker Hub repository and the NVIDIA NGC repository, these images have a multitude of other packages installed with them, making it difficult to integrate them into pre-existing projects. Moreover, many practitioners prefer using local environments over Docker images.

The project presented here is different. It has no additional libraries to work with except for those installed by the user. Even better, the wheels generated by the build can be extracted for use in any environment, with no need to learn how to use Docker, though the second part of this project provides a docker-compose.yaml file to make using Docker much easier.

If you are among those who could but only yearn for a quicker end to the long hours endured staring at Tensorboard as your models inched past the epochs, this project may be just the thing. When using a source build of PyTorch with the latest version of CUDA, combined with AMP, one may achieve training/inference times x10 faster than a naïve PyTorch environment.

I sincerely hope that my project will be of service to practitioners in both academia and industry. Users who find my work beneficial are more than welcome to show their appreciation by starring this repository.

Warning

Before using this template, first check whether you are actually using your GPU!

In most scenarios, slow training is caused by an inefficient Extract, Transform, Load (ETL) pipeline. Training is slow because the data is not getting to the GPU fast enough, not because the GPU is running slowly. Run watch nvidia-smi to check whether GPU utilization is high enough to justify compute optimizations. If GPU utilization is low or peaks sporadically, design an efficient ETL pipeline before using this template. Otherwise, faster compute will not help very much as it will not be the bottleneck.

See https://www.tensorflow.org/guide/data_performance for a guide on designing an efficient ETL pipeline.

The NVIDIA DALI library may also be helpful. The DALI PyTorch plugin provides an API for efficient ETL pipelines in PyTorch.

Introduction

A Template repository to build PyTorch from source on any version of PyTorch/CUDA/cuDNN.

To use this template, press the green Use this template button on the top. This is more convenient than forking this repository.

PyTorch built from source is much faster (as much as x4 times on some benchmarks, though x2 is more typical) than PyTorch installed from pip/conda but building from source is an arduous and bug-prone process.

This repository is a highly modular template to build any version of PyTorch from source on any version of CUDA. It provides an easy-to-use Dockerfile that can be integrated into any Linux-based image or project.

For researchers unfamiliar with Docker, the generated wheel files can be extracted to install PyTorch on their local environments. Windows users may also use this project via WSL. See instructions below.

A Makefile is provided both as an interface for easy use and as a tutorial for building custom images. A docker-compose.yaml file is also provided for a simple interactive development experience using Docker.

The speed gains from this template come from the following factors:

Using the latest version of CUDA and associated libraries (cuDNN, cuBLAS, etc.).
Using a source build made specifically for the target machine with the latest software customizations instead of a build that must be compatible with different hardware and software environments.
Using the latest version of PyTorch and subsidiary libraries. Many users do not update their PyTorch version because of compatibility issues with their pre-existing environment.
Informing users on where to look for solutions to their speed problems (this may be the most important factor).

Combined with techniques such as AMP and cuDNN benchmarking, computational throughput can be increased dramatically (possibly x10) on the same hardware.

Even if you do not wish to use Docker in your project, you may still find this template useful.

The wheel files generated by the build can be used in any Python environment with no dependency on Docker.

This project can thus be used to generate custom wheel files, improving both training and inference speeds dramatically for any desired environment (conda, pip, etc.).

Quickstart

Users are free to customize the train stage of the Dockerfile as they please. However, do not change the build stages unless absolutely necessary. If a new package must be built, add a new build layer.

This project is a template, and users are expected to customize it to fit their needs.

The code is assumed to be running on a Linux host with the necessary NVIDIA Drivers and a recent version of Docker & Docker Compose pre-installed. If this is not the case, install these first.

To build a training image, first edit the Dockerfile train stage to include desired packages from apt/conda/pip.

Then, visit https://developer.nvidia.com/cuda-gpus to find the Compute Capability (CC) of the target GPU device.

Finally, run make all CC=TARGET_CC(s).

Examples

(1) make all CC="8.6" for RTX 3090, (2) make all CC="7.5 8.6" for both RTX 2080Ti and RTX 3090 (building for many GPU CCs will increase build time).

This will result in an image, pytorch_source:train, which can be used for training. Note that CCs for devices not available during the build can be used to build the image. For example, if the image must be used on an RTX 2080Ti machine but the user only has an RTX 3090, the user can set CC="7.5" to enable the image to operate on the RTX 2080Ti GPU. See https://pytorch.org/docs/stable/cpp_extension.html for an in-depth guide on how to set TORCH_CUDA_ARCH_LIST, which is specified by CC in the Makefile.

Makefile Explanation

The Makefile is designed to make using this package simple and modular.

The first image to be created is pytorch_source:build_install, which contains all packages necessary for the build. The installation image is created separately to cache downloads.

The second image is pytorch_source:build_torch-v1.9.1 (by default), which contains the wheels for PyTorch, TorchVision, TorchText, and TorchAudio with settings for PyTorch 1.9.1 on Ubuntu 20.04 LTS with Python 3.8, CUDA 11.3.1 and cuDNN 8. The second image exists to cache the results of the build process.

If you do not wish to use Docker and would like to only extract the .whl wheel files for a pip install on your environment, the generated wheel files can be found in the /tmp/dist directory.

Saving the build results also allows for more convenient version switching in case different PyTorch versions (different CUDA version, different library version, etc.) are needed.

The final image is pytorch_source:train, which is the image to be used for actual training. It relies on the previous stages only for the build artifacts (wheels, etc.) and nothing else. This makes it very simple to create different training images optimized for different environments and GPU devices.

Because PyTorch has already been built, the training image only needs to download the remaining apt/conda/pip packages. Caching is also implemented to speed up even this process.

Timezone Settings

International users may find this section helpful.

The train image has its timezone set by the TZ variable using the tzdata package. The default timezone is Asia/Seoul but this can be changed by specifying the TZ variable when calling make. Use IANA timezone names to specify the desired timezone.

Example: make all CC="8.6" TZ=America/Los_Angeles uses L.A. time on the training image.

N.B. Only the training image has timezone settings. The installation and build images do not use timezone information.

In addition, the training image has apt and pip installation URLs updated for Korean users. If you wish to speed up your installs, please find URLs optimized for your location, though the installation caches may make this unnecessary.

Specific PyTorch Version

PyTorch subsidiary libraries only work with matching versions of PyTorch.

To change the version of PyTorch, set the PYTORCH_VERSION_TAG, TORCHVISION_VERSION_TAG, TORCHTEXT_VERSION_TAG, and TORCHAUDIO_VERSION_TAG variables to matching versions.

The *_TAG variables must be GitHub tags or branch names of those repositories. Visit the GitHub repositories of each library to find the appropriate tags.

Example: To build on an RTX 3090 GPU with PyTorch 1.9.1, use the following command:

make all CC="8.6" PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1.

The resulting image, pytorch_source:train, can be used for training with PyTorch 1.9.1 on GPUs with Compute Capability 8.6.

Multiple Training Images

To use multiple training images on the same host, give a different name to TRAIN_NAME, which has a default value of train.

New training images can be created without having to rebuild PyTorch if the same build image is used for different training images. Creating new training images takes only a few minutes at most.

This is useful for the following use cases.

Allowing different users, who have different UID/GIDs, to use separate training images.
Using different versions of the final training image with different library installations and configurations.
Using this template for multiple PyTorch projects, each with different libraries and settings.

For example, if pytorch_source:build_torch-v1.9.1 has already been built, Alice and Bob would use the following commands to create separate images.

Alice: make build-train CC="8.6" TORCH_NAME=build_torch-v1.9.1 PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1 TRAIN_NAME=train_alice

Bob: make build-train CC="8.6" TORCH_NAME=build_torch-v1.9.1 PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1 TRAIN_NAME=train_bob

This way, Alice's image would have her UID/GID while Bob's image would have his UID/GID. This procedure is necessary because training images have their users set during the build. Also, different users may install different libraries in their training images. Their environment variables and other settings may also be different.

Word of Caution

When using build images such as pytorch_source:build_torch-v1.9.1 as a build cache for creating new training images, the user must re-specify all build arguments (variables specified by ARG and ENV using --build-arg) of all previous layers.

Otherwise, the default values for these arguments will be given to the Dockerfile and a cache miss will occur because of the different input values.

This will both waste time rebuilding previous layers and, more importantly, cause inconsistency in the training images due to environment mismatch.

This includes the docker-compose.yaml file as well. All arguments given to the Dockerfile during the build must be respecified. This includes default values present in the Makefile but not present in the Dockerfile such as the version tags.

If Docker starts to rebuild layers that you have already built, suspect that build arguments have been given incorrectly.

See https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache for more information.

Users must set BUILDKIT_INLINE_CACHE=1 during the image build to use it as a cache later. See https://docs.docker.com/engine/reference/commandline/build/#specifying-external-cache-sources for more information.

Advanced Usage

The Makefile provides the *-full commands for advanced usage.

make all-full CC=YOUR_GPU_CC TRAIN_NAME=train_cu102 will create pytorch_source:build_install-ubuntu18.04-cuda10.2-cudnn8-py3.9, pytorch_source:build_torch-v1.9.1-ubuntu18.04-cuda10.2-cudnn8-py3.9, and pytorch_source:train_cu102 by default.

These images can be used for training/deployment on CUDA 10 devices such as the GTX 1080Ti.

Also, the *-clean commands are provided to check for cache reliance on previous builds.

Specific CUDA Version

Set CUDA_VERSION, CUDNN_VERSION, and MAGMA_VERSION to change CUDA versions. PYTHON_VERSION may also be changed if necessary.

This will create a build image that can be used as a cache to create training images with the build-train command.

Also, the extensive use of caching in the project means that the second build is much faster than the first build. This may be advantageous if many images must be created for multiple PyTorch/CUDA versions.

Specific Linux Distro

CentOS and UBI images can be created with only minor edits to the Dockerfile. Read the Dockerfile for full instructions.

Set the LINUX_DISTRO and DISTRO_VERSION arguments afterwards.

Windows

Windows users may use this template by updating to Windows 11 and installing Windows Subsystem for Linux (WSL). WSL on Windows 11 gives a similar experience to using native Linux.

This project has been tested on Windows 11 WSL with the WSL CUDA driver and Docker Desktop for Windows.

Interactive Development & MLOps with Docker Compose

Raison D'être

The purpose of this section is to introduce a new paradigm for deep learning development. I hope that, eventually, using Docker Compose for deep learning projects will become best practice.

Developing in local environments with conda or pip is commonplace in the deep learning community. However, this risks making the development environment, and the code meant to run on it, unreproducible. This is a serious detriment to scientific progress that many readers of this article will have experienced at first-hand.

Docker containers are the standard method for providing reproducible programs across different computing environments. They create isolated environments where programs can run without interference from the host or from one another. See https://www.docker.com/resources/what-container for details.

But in practice, Docker containers are often misused. Containers are meant to be transient. Best practice dictates that a new container be created for each run. This, however, is very inconvenient for development, especially for deep learning applications, where new libraries must constantly be installed and bugs are often only evident at runtime. This leads many researchers to develop inside interactive containers. Docker users often have run.sh files with commands such as docker run -v my_data:/mnt/data -p 8080:22 -t my_container my_image:latest /bin/bash (look familiar, anyone?) and use SSH to connect to running containers. VSCode also provides a remote development mode to code inside containers.

The problem with this approach is that these interactive containers become just as unreproducible as local development environments. A running container cannot connect to a new port or attach a new volume. But if the computing environment within the container was created over several months of installs and builds, the only way to keep it is to save it as an image and create a new container from the saved image. After a few iterations of this process, the resulting image becomes bloated and just as unreproducible as the local environments that they were meant to replace.

Problems become even more evident when preparing for deployment. MLOps, defined as a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently, has gained enormous popularity of late as many practitioners have come to realize the importance of continuously maintaining ML systems long after the initial development phase ends.

However, bad practices such as those mentioned above mean that much coffee is spilled turning research code into something that is production-ready. Often, even the original developers cannot retrain the same model after a few months. Many firms thus have entire teams dedicated to model translation, a huge expenditure.

To alleviate these problems, I propose the use of Docker Compose as a basic MLOps solution for both development and production. Using Docker and Docker Compose, the entire training environment can be reproduced. Compose has not yet caught on in the deep learning community, possibly because it is usually advertised as a multi-container solution, though it can be used for single-container development just as well.

A docker-compose.yaml file is provided for easy management of containers. Using the provided docker-compose.yaml file will create an interactive environment, providing a programming experience very similar to using a terminal on a remote server. Integrations with popular IDEs (PyCharm, VSCode) are also available. Moreover, it also allows the user to specify settings for both build and run, removing the need to manage the environment with custom shell scripts. Connecting a new volume is as simple as removing the current container, adding a line in the docker-compose.yaml/Dockerfile file, then creating a new container from the same image. Build caches allow new images to be built very quickly, removing another barrier to Docker adoption, the long initial build time.

Docker Compose can also be used directly for deployment with swarm mode, which is an excellent solution for small-scale deployments (one physical server with up to 8 GPUs). See https://docs.docker.com/engine/swarm for documentation. Though less capable than Kubernetes, swarm mode has a much gentler learning curve, requiring less experienced (read expensive) engineers to utilize. Also, at the risk of deflating some egos, I wish to point out that the vast majority of services never go "planet scale", whatever the CEO has been pitching to investors. Even if large-scale deployments do become necessary, using Docker from the very beginning will accelerate the development process and make MLOps adoption much simpler. Accelerating time-to-market by streamlining the development process is a competitive edge for any firm, whether lean startup or tech titan.

With luck, the deep learning community will be able to "code once, train anywhere" with the technique I propose here. But even if I fail in persuading the majority of users of the merits of my method, I may still spare many a hapless grad student from the sisyphean labor of setting up their conda environment, only to have it crash and burn right before their paper submission is due.

Usage

Docker images created by the Makefile are fully compatible with the docker-compose.yaml file. There is no need to erase them to use Docker Compose.

Using Docker Compose V2 (see https://docs.docker.com/compose/cli-command), run the following two commands, where train is the default service name in the provided docker-compose.yaml file.

Read docker-compose.yaml and set variables in the .env file (first time only).
docker compose up -d train
docker compose exec train /bin/bash

This will open an interactive shell with settings specified by the train service in the docker-compose.yaml file. Environment variables can be saved in a .env file placed on the project root, allowing different projects and different users to set their own variables as required. To create a basic .env file with the UID and GID, run make env.

Example .env file for RTX 3090 GPUs:

UID=1000
GID=1000
CC=8.6

This is extremely convenient for managing reproducible development environments. For example, if a new pip or apt package must be installed for the project, users can simply edit the train layer of the Dockerfile by adding the package to the apt-get install or pip install commands, then run the following command:

docker compose up -d --build train.

This will remove the current train session, rebuild the image, and start a new train session. It will not, however, rebuild PyTorch (assuming no cache miss occurs). Users thus need only wait a few minutes for the additional downloads, which are accelerated by caching and fast mirror URLs.

To stop and restart a service after editing the Dockerfile or docker-compose.yaml file, simply run docker compose up -d --build train again.

To stop services and remove containers, use the following command:

docker compose down.

Users with remote servers may use Docker contexts (see https://docs.docker.com/engine/context/working-with-contexts) to access their containers from their local environments. For more information on Docker Compose, see the documentation https://github.com/compose-spec/compose-spec/blob/master/spec.md.

Initial Setup in Detail

If this is your first time using this project, follow these steps:

Install Docker Compose V2 for Linux as described in https://docs.docker.com/compose/cli-command/#install-on-linux. Installation does not require root permissions. Check the version and architecture tags in the URL before installing. The following commands will install Docker Compose V2 (v2.1.0, Linux x86_64) for a single user.

mkdir -p ~/.docker/cli-plugins/
curl -SL https://github.com/docker/compose/releases/download/v2.1.0/docker-compose-linux-x86_64 -o ~/.docker/cli-plugins/docker-compose
chmod +x ~/.docker/cli-plugins/docker-compose

The instructions above are for Linux hosts. WSL users should instead enable "Use Docker Compose V2" on Docker Desktop for Windows.

Run make env on the terminal to create a basic .env file. Then read the docker-compose.yaml file to fill in extra variables. Also edit docker-compose.yaml as necessary for your project.
Run docker compose up -d --build train or docker compose up -d --build full. The train service corresponds to the default make all ... build while the full service corresponds to the make all-full ... build. If you have already run make all ... or make all-full ..., check that the docker-compose.yaml file has the same configurations as the make command used to create the Docker images. Otherwise, a cache miss will occur, rebuilding the image with the new settings.
After docker compose up -d --build SERVICE_NAME has finished and if you have not yet run make all(-full) ..., run the make build with the same settings as the docker-compose.yaml and .env file settings. This will save the build cache as images, preventing them from being cleared by the system later on. If no cache miss occurs, this will take only a few minutes.
Run docker compose exec SERVICE_NAME /bin/bash and start coding.

Compose as Best Practice

I emphasize that using Docker Compose like this is a general-purpose technique that does not depend on anything about this project. As an example, an image from the NVIDIA NGC PyTorch repository has been used as the base image in ngc.Dockerfile. The NVIDIA NGC PyTorch images contain many optimizations for the latest GPU architectures and provide a multitude of pre-installed machine learning libraries. For anyone starting a new project, and therefore with no dependencies, using the latest NGC image is recommended.

To use the NGC images, use the following commands:

docker compose up -d ngc
docker compose exec ngc /bin/bash

The only difference with the previous example is the session name.

Using Compose with PyCharm and VSCode

The Docker Compose container environment can be used with popular Python IDEs, not just in the terminal. PyCharm and Visual Studio Code, both very popular in the deep learning community, are compatible with Docker Compose.

If you are using a remote server, first create a Docker context to connect your local Docker with the remote Docker.
PyCharm (Professional only): Docker Compose is available natively as a Python interpreter. See tutorial for details. Note that PyCharm Professional is available free of charge to anyone with a valid university e-mail address.
VSCode: Install the Remote Development extension pack. See tutorial for details.

Known Issues

Connecting to a running container by ssh will remove all variables set by ENV. This is because sshd starts a new environment, wiping out all previous variables. Using docker/docker compose to enter containers is strongly recommended.
Building on CUDA 11.4.x or greater is not available as of November 2021 because magma-cuda114 has not been released on the pytorch channel of anaconda. Users may attempt building with older versions of magma-cuda or try the version available on conda-forge. A source build of magma would be welcome as a pull request. Note that the NVIDIA NGC images use an in-house build of magma.
Ubuntu 16.04 build fails. This is because the default git installed by apt on Ubuntu 16.04 does not support the --jobs flag. Add the git-core PPA to apt and install the latest version of git. Also, PyTorch v1.9+ will not build on Ubuntu 16. Lower the version tag to v1.8.2 to build. However, the project will not be modified to accommodate Ubuntu 16.04 builds as Xenial Xerus has already reached EOL.
Docker Compose does not run on WSL. Disable ipc: host. WSL cannot use this option.
torch.cuda.is_available() returns ... UserWarning: CUDA initialization:... error or the image will simply not start. This indicates that the CUDA driver on the host is incompatible with the CUDA version on the Docker image. Either upgrade the host CUDA driver or downgrade the CUDA version of the image. Check the compatibility matrix to see if the host CUDA driver is compatible with the desired version of CUDA.

Desiderata

MORE STARS. If you are reading this, star this repository immediately. I'm serious.
CentOS and UBI images have not been implemented yet. As they require only simple modifications, pull requests implementing them would be very much welcome.
Translations into other languages are welcome. Please make a separate LANG.README.md file and create a PR.
A method to build magma from source would be greatly appreciated. Although the code for building the magma package is available at https://github.com/pytorch/builder/tree/main/magma, it is only updated several months after a new CUDA version is released. A source build as a layer on the image would be welcome.
Please feel free to share this project! I wish you good luck and happy coding!

ThanThoai / PyTorch-Universal-Docker-Template