FTPipe

This repository contains code used for FTPipe USENIX ATC21 paper "Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism", and future works.

See citation information at the bottom of this readme.

Overview

This repository was used to explore various unexplored territories of pipeline-model-parallelism. It is capable of automatically partitionining, training and fine-tuning giant neural networks, with both synchronous and asynchronous pipelines.
Code for Pipeline Staleness Mitigation study is included as well.

Models supported and tested are Huggingface transformers (T5, GPT2, BERT, RoBerta...), many Torchvision models (probably all), and Vision Transformers. (conducted an out-of-the-box ViT PoC with the first pytorch implementation, by timm, right when it apeared.)
The setup for T5-11B is currently kept on a seperate branch.

Basic Usage

Clone the repository:

git clone https://github.com/saareliad/FTPipe.git

All code is currently designed to run from repository root.

After completing the environment setup, FTPipe's usage is mainly the two following steps:

Partitioning models
Running models.

python -m autopipe.partition ... # partition models

python -m pipe.main ... # train models (+eval)

Additional documentations:

Training arguments should be passed via json configuration files (*)
New Models, training/fine-tuning tasks, and datasets should be registered to the framework.
Additional arguments are passed as cmd args. Do use the --help option to exlore. (NOTE: It is also possible to override some configuration arguments using the command line, use with caution. Partitioning uses mostly cmd args.)
As P2P communication is done with MPI, running models often looks like this

mpirun -np 8 python -m pipe.main --config $PATH_TO_JSON_CONFIG

Refer to examples of recent scripts we used to partition and conduct T5 experiments.
Do feel free to contact (issue/mail/linkedin/...).

(*Note: a more comprehensive explanation is planned, meanwhile, configuration can be understood via examples or code).

Setup

Follow the instructions to setup the required conda env. This includes building pytorch from source with cuda-aware openmpi.
NOTE: Model partitioning can be done using a much simpler conda env (without mpi or building from source)

conda env create -f pipe/env_utils/env_without_mpi.yml

The simiple recpie below was used to set it up on our servers

BUILD_DIR=<SOMEPLACE_FOR_DOWNLOADED_SOFTWARE> # openmpi, pytorch
cd pipe/env_utils
cp create_env_new_server.sh $BUILD_DIR
cd $BUILD_DIR
vim create_env_new_server.sh  # change paths: home_local, FTPIPE_ROOT
bash create_env_new_server.sh # it is safer to run it step by step.

where $BUILD_DIR is set to a a repository to place the clones of openmpi and pytorch.

Aditional docs

Work in progress to add all docs in thier own docs directory.

Some additional usage instructions are documented across the repository. For example:

At the pipe module, there are instructions and scripts for running downloading data,
Refer to the pipes-list for availalble staleness mitigation and pipelines which can be used at runtime.
See the autopipe module for avaialbe partitioning methods. See the tasks directory for examples of partitioning tasks (e.g., differnt models architechtures or downstream fine-tuning tasks).
A detailed example of steps/changes taken to export a T5 model from huggingface can be found here.

Note

Note: some hyper-parameters in mpipe partitioning (e.g., GPU memory capacity), env and so on are still hardcoded to our and not available as cmd options. Currently, one will need to change them change them manually to experiment (As we did...)

Citation

@inproceedings {ftpipe,
author = {Saar Eliad and Ido Hakimi and Alon De Jagger and Mark Silberstein and Assaf Schuster},
title = {Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism},
booktitle = {2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21)},
year = {2021},
isbn = {978-1-939133-23-6},
pages = {381--396},
url = {https://www.usenix.org/conference/atc21/presentation/eliad},
publisher = {{USENIX} Association},
month = jul,
}

saareliad / FTPipe