MARD1NO / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

License MIT PyPI version Downloads Build

Latest News

DeepSpeed trained the world's most powerful language models (MT-530B, BLOOM); learn how.

Extreme Speed and Scale for DL Training and Inference

DeepSpeed is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can:

  • Train/Inference dense or sparse models with billions or trillions of parameters
  • Achieve excellent system throughput and efficiently scale to thousands of GPUs
  • Train/Inference on resource constrained GPU systems
  • Achieve unprecedented low latency and high thoughput for inference
  • Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs

DeepSpeed's three innovation pillars


DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar. Learn more: DeepSpeed-Training


DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, thoughput and cost reduction. This systematic composition of system technologies for inference falls under the inference pillar. Learn more: DeepSpeed-Inference


To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the compression pillar. Learn more: DeepSpeed-Compression

DeepSpeed Software Suite

DeepSpeed Library

The DeepSpeed library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, infernece or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see DeepSpeed Adoption).

Model Implementations for Inference (MII)

Model Implementations for Inference (MII) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.

DeepSpeed on Azure

DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML recipes. The job submission and data preparation scripts have been made available here. For more details on how to use DeepSpeed on Azure, please follow the Azure tutorial.

DeepSpeed Adoption

DeepSpeed is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):

DeepSpeed has been integrated with several different popular open-source DL frameworks such as:

Transformers with DeepSpeed
Accelerate with DeepSpeed
Lightning with DeepSpeed
MosaicML with DeepSpeed

Build Pipeline Status

Description Status
NVIDIA nv-torch12-p40 nv-torch18-v100 nv-torch-latest-v100 nv-inference
AMD amd
PyTorch Nightly nv-torch-nightly-v100
Integrations nv-transformers-v100 nv-lightning-v100 nv-accelerate-v100
Misc Formatting pages-build-deployment Documentation Status


The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.


  • PyTorch must be installed before installing DeepSpeed.
  • For full feature support we recommend a version of PyTorch that is >= 1.8 and ideally the latest PyTorch stable release.
  • Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
    • NVIDIA: Pascal, Volta, and Ampere architectures
    • AMD: MI100 and MI200


We regularly push releases to PyPI and encourage users to install from there in most cases.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.


If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.


Windows support is partially supported with DeepSpeed. On Windows you can build wheel with following steps, currently only inference mode is supported.

  1. Install pytorch, such as pytorch 1.8 + cuda 11.1
  2. Install visual cpp build tools, such as VS2019 C++ x64/x86 build tools
  3. Launch cmd console with Administrator privilege for creating required symlink folders
  4. Run python bdist_wheel to build wheel in dist folder


Please checkout DeepSpeed-Training, DeepSpeed-Inference and DeepSpeed-Compression pages for full set of features offered along each of these three pillars.

Further Reading

All DeepSpeed documentation, tutorials, and blogs can be found on our website:

Getting Started First steps with DeepSpeed
DeepSpeed JSON Configuration Configuring DeepSpeed
API Documentation Generated DeepSpeed API documentation
Tutorials Tutorials
Blogs Blogs


DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact with any additional questions or comments.


  1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).
  2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
  3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.
  4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840.
  5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888 and ICML 2021.
  6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857.
  7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. arXiv:2104.06069.
  8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. arXiv:2108.06084.
  9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. arXiv:2202.06009.
  10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arXiv:2201.05596.
  11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model arXiv:2201.11990.
  12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. arXiv:2206.01859.
  13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv:2206.01861.
  14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032.


  1. DeepSpeed KDD 2020 Tutorial
    1. Overview
    2. ZeRO + large model training
    3. 17B T-NLG demo
    4. Fastest BERT training + RScan tuning
    5. DeepSpeed hands on deep dive: part 1, part 2, part 3
    6. FAQ
  2. Microsoft Research Webinar
  3. DeepSpeed on AzureML
  4. Community Tutorials


DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

License:MIT License


Language:Python 80.6%Language:Cuda 10.1%Language:C++ 7.7%Language:Shell 0.7%Language:C 0.6%Language:Dockerfile 0.2%