Awesome Large Model (LM) System

This repo collects papers, repos, tools for large model system, including training, inference, serving and compression.

Awesome Large Model (LM) System
- Papers
  - Training
  - Inference
  - Benchmark
  - Survey
- Frameworks

Papers

Training

Year	Publisher	Title	Framework
2023		Training Diffusion Models with Reinforcement Learning
2023		Extracting Training Data from Diffusion Models
2023	ICLR	DySR: Adaptive Super-Resolution via Algorithm and System Co-design	DeepSpeed
2023		Scaling Vision-Language Models with Sparse Mixture of Experts	DeepSpeed
2023	IPDPS	MCR-DL: Mix-and-Match Communication Runtime for Deep Learning	DeepSpeed
2023	ICS	A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training	DeepSpeed
2023	OSDI	AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving	Alpa
2023	MLSys	On Optimizing the Communication of Model Parallelism	Alpa
2023		Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models	ColossalAI
2022	CVPR	Perception Prioritized Training of Diffusion Models
2022		Reducing Activation Recomputation in Large Transformer Models	Megatron-LM
2022	HiPC	1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed	DeepSpeed
2022	NeurIPS	The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models	DeepSpeed
2022		Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam	DeepSpeed
2022	ICML	DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale	DeepSpeed
2022		Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model	DeepSpeed
2022		Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers	DeepSpeed
2022		DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing	DeepSpeed
2022	OSDI	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	Alpa
2022	ICPP	Tesseract: Parallelize the Tensor Parallelism Efficiently	ColossalAI
2022		A Frequency-aware Software Cache for Large Recommendation System Embeddings	ColossalAI
2022	TPDS	Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management	ColossalAI
2021		Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM	Megatron-LM
2021		LoRA: Low-Rank Adaptation of Large Language Models
2021	SC	ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning	DeepSpeed
2021	ICML	1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed	DeepSpeed
2021	ATC	ZeRO-Offload: Democratizing Billion-Scale Model Training.	DeepSpeed
2021	PPoPP	DAPPLE: a pipelined data parallel approach for training large models
2021	ICML	TeraPipe: Token-Level Pipeline Parallelism for Training Large	TeraPipe
2021	ICML	Memory-Efficient Pipeline-Parallel DNN Training	PipeDream
2021		An Efficient 2D Method for Training Super-Large Deep Learning Models	ColossalAI
2021		Maximizing Parallelism in Distributed Training for Huge Neural Networks	ColossalAI
2021		Sequence Parallelism: Long Sequence Training from System Perspective	ColossalAI
2021		Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training	ColossalAI
2020	KDD Tutorial	DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.	DeepSpeed
2020	SC	ZeRO: memory optimizations toward training trillion parameter models.	DeepSpeed
2020	NeuraIPS	Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping	DeepSpeed
2020		Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	Megatron-LM
2020		torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models	TorchGpipe
2019	NeuraIPS	GPipe: efficient training of giant neural networks using pipeline parallelism	TorchGpipe
2019	SOSP	PipeDream: Generalized pipeline parallelism for DNN training	PipeDream

Compression

Year	Publisher	Title	Framework
2023		Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
2023		CBQ: Cross-Block Quantization for Large Language Models
2023		Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
2023		Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
2023		Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
2023		RPTQ: Reorder-based Post-training Quantization for Large Language Models
2023		SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
2023		LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
2023		QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
2023		LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
2023		AffineQuant: Affine Transformation Quantization for Large Language Models
2023		LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
2023		QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
2023		LLM-Pruner: On the Structural Pruning of Large Language Models
2023		OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
2023		SqueezeLLM: Dense-and-Sparse Quantization
2023		A Simple and Effective Pruning Approach for Large Language Models
2023		On Architectural Compression of Text-to-Image Diffusion Models
2023	ICML	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
2023		AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
2023		OWQ: Lessons learned from activation outliers for weight quantization in large language models
2023	ICLR	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
2023	ISCA	OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
2023		Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
2023		ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
2023	ICML	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
2023	ICML	Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases	DeepSpeed
2023		Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
2023		QLoRA: Efficient Finetuning of Quantized LLMs
2022	NeuraIPS	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	DeepSpeed
2022	NeuraIPS	Extreme Compression for Pre-trained Transformers Made Simple and Efficient	DeepSpeed
2022	NeuraIPS	Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
2022	NeuraIPS	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
2022	NeuraIPS	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
2021	EMNLP	Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Inference

Year	Publisher	Title	Framework
2023		Fast Inference in Denoising Diffusion Models via MMD Finetuning
2023		EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models	EnergonAI
2023		H₂O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
2023		FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
2022	ICML	DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale	DeepSpeed
2022	SC	DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale	DeepSpeed

Benchmark

Year	Publisher	Title	Framework
Year	Pub	Title	Framework
Year	Pub	Title1	Framework

Survey

Year	Publisher	Title	Framework
Year	Pub	Title	Framework
Year	Pub	Title1	Framework

Frameworks

Year	Name	Training	Inference	Serving	Comments
2023	EnergonAI	✗	✔	✗
2022	Alpa	✔	✔	✔	Compilation based mixed parallelism
2021	Megatron-DeepSpeed	✔	✗	✗	Add MoE model training, Curriculum Learning, 3D Parallelism from DeepSpeed to Megatron
2021	TeraPipe	✔	✗	✗
2021	ColossalAI	✔	✔	✔
2021	FasterTransformer	✗	✔	✗
2020	DeepSpeed	✔	✔	✗	General Support of Transformers and MoE with 3d-parallelism
2019	Megatron-LM	✔	✗	✗
2019	PipeDream	✔	✗	✗
2019	TorchGipe	✔	✗	✗	The `torchgipe` has been merged to PyTorch in 2020.
2019	PipeDream	✔	✗	✗

ModelTC / awesome-lm-system

Awesome Large Model (LM) System

Papers

Training

Compression

Inference

Benchmark

Survey

Frameworks

About