lilujunai/Awesome-LLM-Inference

📒Introduction

Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes, please check 📖Contents for more details.

©️Citations

@misc{Awesome-LLM-Inference@2023,
  title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/DefTruth/Awesome-LLM-Inference},
  note={Open-source software available at https://github.com/DefTruth/Awesome-LLM-Inference},
  author={Yanjun Qiu},
  year={2023}
}

🎉Download PDFs

@Awesome-LLM-Inference-v0.3.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.

📙Awesome LLM Inference Papers with Codes

📖LLM Algorithmic/Eval Survey (©️back👆🏻)

Date	Title	Paper	Code	Recom
2023.10	[Evaluating] Evaluating Large Language Models: A Comprehensive Survey(@tju.edu.cn)	[pdf]	[Awesome-LLMs-Evaluation]	⭐️
2023.11	🔥[Runtime Performance] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models(@hkust-gz.edu.cn)	[pdf]	⚠️	⭐️⭐️
2023.11	[ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?(@e.ntu.edu.sg)	[pdf]	⚠️	⭐️
2023.12	[Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey(@Microsoft)	[pdf]	⚠️	⭐️
2023.12	[Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly(@Drexel University)	[pdf]	⚠️	⭐️
2023.12	🔥[LLMCompass] A Hardware Evaluation Framework for Large Language Model Inference(@princeton.edu)	[pdf]	⚠️	⭐️⭐️
2023.12	🔥[Efficient LLMs] Efficient Large Language Models: A Survey(@Ohio State University etc)	[pdf]	[Efficient-LLMs-Survey]	⭐️⭐️
2023.12	[Serving Survey] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems(@Carnegie Mellon University)	[pdf]	⚠️	⭐️⭐️

📖LLM Train/Inference Framework (©️back👆🏻)

Date	Title	Paper	Code	Recom
2020.05	🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)	[pdf]	[Megatron-LM]	⭐️⭐️
2023.03	[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc)	[pdf]	[FlexGen]	⭐️
2023.05	[SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc)	[pdf]	[FlexFlow]	⭐️
2023.05	[FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc)	[pdf]	⚠️	⭐️
2023.09	🔥[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc)	[pdf]	[vllm]	⭐️⭐️
2023.09	[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc)	[pdf]	[streaming-llm]	⭐️
2023.09	[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc)	[blog]	[Medusa]	⭐️
2023.10	🔥[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA)	[docs]	[TensorRT-LLM]	⭐️⭐️
2023.11	🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)	[blog]	[deepspeed-fastgen]	⭐️⭐️
2023.12	🔥[PETALS] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc)	[pdf]	[petals]	⭐️⭐️
2023.10	[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)	[pdf]	[LightSeq]	⭐️
2023.12	[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)	[pdf]	[PowerInfer]	⭐️

📖Continuous/In-flight Batching (©️back👆🏻)

Date	Title	Paper	Code	Recom
2022.07	🔥[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models(@Seoul National University etc)	[pdf]	⚠️	⭐️⭐️
2023.10	🔥[In-flight Batching] NVIDIA TensorRT LLM Batch Manager(@NVIDIA)	[docs]	[TensorRT-LLM]	⭐️⭐️
2023.11	🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)	[blog]	[deepspeed-fastgen]	⭐️⭐️
2023.11	[Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting(@Microsoft etc)	[pdf]	⚠️	⭐️
2023.12	[SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances(@cmu.edu etc)	[pdf]	[SpotServe]	⭐️
2023.10	[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)	[pdf]	[LightSeq]	⭐️

📖Weight/Activation Quantize/Compress (©️back👆🏻)

Date	Title	Paper	Code	Recom
2022.06	🔥[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers(@Microsoft)	[pdf]	[DeepSpeed]	⭐️⭐️
2022.08	[FP8-Quantization] FP8 Quantization: The Power of the Exponent(@Qualcomm AI Research)	[pdf]	⚠️	⭐️
2022.08	[LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale(@Facebook AI Research etc)	[pdf]	[bitsandbytes]	⭐️
2022.10	🔥[GPTQ] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS(@IST Austria etc)	[pdf]	[gptq]	⭐️⭐️
2022.11	🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft)	[pdf]	[FasterTransformer]	⭐️⭐️
2022.11	🔥[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models(@MIT etc)	[pdf]	[smoothquant]	⭐️⭐️
2023.03	[ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation(@Microsoft)	[pdf]	[DeepSpeed]	⭐️
2023.06	🔥[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration(@MIT etc)	[pdf]	[llm-awq]	⭐️⭐️
2023.06	[SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression(@University of Washington etc)	[pdf]	[SpQR]	⭐️
2023.06	[SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION(@berkeley.edu)	[pdf]	[SqueezeLLM]	⭐️
2023.07	[ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats(@Microsoft)	[pdf]	[DeepSpeed]	⭐️
2023.09	[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI)	[blog]	⚠️	⭐️
2023.10	[FP8-LM] FP8-LM: Training FP8 Large Language Models(@Microsoft etc)	[pdf]	[MS-AMP]	⭐️
2023.10	[LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING(@cs.princeton.edu etc)	[pdf]	[LLM-Shearing]	⭐️
2023.10	[LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers(@ust.hk&meta etc)	[pdf]	[LLM-FP4]	⭐️
2023.11	[2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization(@Shanghai Jiao Tong University etc)	[pdf]	⚠️	⭐️
2023.12	[SmoothQuant+] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM(@ZTE Corporation)	[pdf]	[smoothquantplus]	⭐️
2023.11	[OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs(@meituan.com)	[pdf]	⚠️	⭐️
2023.12	🔥[SparQ] SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE(@graphcore.ai)	[pdf]	⚠️	⭐️⭐️
2023.12	[Agile-Quant] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge(@Northeastern University&Oracle)	[pdf]	⚠️	⭐️
2023.12	[CBQ] CBQ: Cross-Block Quantization for Large Language Models(@ustc.edu.cn)	[pdf]	⚠️	⭐️
2023.10	[QLLM] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS(@ZIP Lab&SenseTime Research etc)	[pdf]	⚠️	⭐️

📖IO/FLOPs-Aware/Sparse Attention (©️back👆🏻)

Date	Title	Paper	Code	Recom
2018.05	[Online Softmax] Online normalizer calculation for softmax(@NVIDIA)	[pdf]	⚠️	⭐️
2019.11	🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google)	[pdf]	⚠️	⭐️⭐️
2020.10	[Hash Attention] REFORMER: THE EFFICIENT TRANSFORMER(@Google)	[pdf]	[reformer]	⭐️⭐️
2022.05	🔥[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness(@Stanford University etc)	[pdf]	[flash-attention]	⭐️⭐️
2022.10	[Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY(@Google)	[pdf]	⚠️	⭐️
2023.05	[FlashAttention] From Online Softmax to FlashAttention(@cs.washington.edu)	[pdf]	⚠️	⭐️⭐️
2023.05	[FLOP, I/O] Dissecting Batching Effects in GPT Inference(@Lequn Chen)	[blog]	⚠️	⭐️
2023.05	🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google)	[pdf]	[flaxformer]	⭐️⭐️
2023.06	[Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc)	[pdf]	[dynamic-sparse-flash-attention]	⭐️
2023.07	🔥[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning(@Stanford University etc)	[pdf]	[flash-attention]	⭐️⭐️
2023.10	🔥[Flash-Decoding] Flash-Decoding for long-context inference(@Stanford University etc)	[blog]	[flash-attention]	⭐️⭐️
2023.11	[Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS(@Tsinghua University&Infinigence-AI)	[pdf]	⚠️	⭐️
2023.01	[SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot(@ISTA etc)	[pdf]	[sparsegpt]	⭐️
2023.11	🔥[HyperAttention] HyperAttention: Long-context Attention in Near-Linear Time(@yale&Google)	[pdf]	hyper-attn	⭐️⭐️
2023.11	[Streaming Attention Approximation] One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space(@Adobe Research etc)	[pdf]	⚠️	⭐️
2023.12	🔥[GLA] Gated Linear Attention Transformers with Hardware-Efficient Training(@MIT-IBM Watson AI)	[pdf]	gated_linear_attention	⭐️⭐️
2023.12	[SCCA] SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion(@Beihang University)	[pdf]	⚠️	⭐️
2023.05	[Landmark Attention] Random-Access Infinite Context Length for Transformers(@epfl.ch)	[pdf]	landmark-attention	⭐️⭐️
2023.12	🔥[FlashLLM] LLM in a flash: Efficient Large Language Model Inference with Limited Memory(@Apple)	[pdf]	⚠️	⭐️⭐️

📖KV Cache Scheduling/Quantize/Dropping (©️back👆🏻)

Date	Title	Paper	Code	Recom
2019.11	🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google)	[pdf]	⚠️	⭐️⭐️
2022.06	[LTP] Learned Token Pruning for Transformers(@UC Berkeley etc)	[pdf]	[LTP]	⭐️
2023.05	🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google)	[pdf]	[flaxformer]	⭐️⭐️
2023.05	[KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time(@)	[pdf]	⚠️	⭐️⭐️
2023.06	[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(@Rice University etc)	[pdf]	[H2O]	⭐️
2023.06	[QK-Sparse/Dropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc)	[pdf]	[dynamic-sparse-flash-attention]	⭐️
2023.09	🔥🔥[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc)	[pdf]	[vllm]	⭐️⭐️
2023.09	[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI)	[blog]	⚠️	⭐️
2023.10	🔥[TensorRT-LLM KV Cache FP8] NVIDIA TensorRT LLM(@NVIDIA)	[docs]	[TensorRT-LLM]	⭐️⭐️
2023.10	🔥[Adaptive KV Cache Compress] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(@illinois.edu&microsoft)	[pdf]	⚠️	⭐️⭐️
2023.10	[CacheGen] CacheGen: Fast Context Loading for Language Model Applications(@Chicago University&Microsoft)	[pdf]	⚠️	⭐️
2023.12	[KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc)	[pdf]	⚠️	⭐️
2023.11	[Prompt Cache] PROMPT CACHE: MODULAR ATTENTION REUSE FOR LOW-LATENCY INFERENCE(@Yale University etc)	[pdf]	⚠️	⭐️

📖Early-Exit/Intermediate Layer Decoding (©️back👆🏻)

Date	Title	Paper	Code	Recom
2020.04	[DeeBERT] DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference(@uwaterloo.ca)	[pdf]	⚠️	⭐️
2021.06	[BERxiT] BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression(@uwaterloo.ca)	[pdf]	[berxit]	⭐️
2023.10	🔥[LITE] Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE(@Arizona State University)	[pdf]	⚠️	⭐️⭐️
2023.12	🔥🔥[EE-LLM] EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism(@alibaba-inc.com)	[pdf]	[EE-LLM]	⭐️⭐️
2023.10	🔥[FREE] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding(@KAIST AI&AWS AI)	[pdf]	[fast_robust_early_exit]	⭐️⭐️

📖Parallel Decoding/Sampling (©️back👆🏻)

Date	Title	Paper	Code	Recom
2018.11	🔥[Parallel Decoding] Blockwise Parallel Decoding for Deep Autoregressive Models(@Berkeley&Google)	[pdf]	⚠️	⭐️⭐️
2023.02	🔥[Speculative Sampling] Accelerating Large Language Model Decoding with Speculative Sampling(@DeepMind)	[pdf]	[LLMSpeculativeSampling]	⭐️⭐️
2023.05	🔥[Speculative Sampling] Fast Inference from Transformers via Speculative Decoding(@Google Research etc)	[pdf]	[LLMSpeculativeSampling]	⭐️⭐️
2023.09	🔥[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc)	[blog]	[Medusa]	⭐️⭐️
2023.10	[OSD] Online Speculative Decoding(@UC Berkeley etc)	[pdf]	⚠️	⭐️⭐️
2023.12	[Cascade Speculative] Cascade Speculative Drafting for Even Faster LLM Inference(@illinois.edu)	[pdf]	⚠️	⭐️

Date	Title	Paper	Code	Recom
2023.12	[FLAP] Fluctuation-based Adaptive Structured Pruning for Large Language Models(@Chinese Academy of Sciences etc)	[pdf]	[FLAP]	⭐️⭐️
2023.12	🔥[LASER] The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction(@mit.edu)	[pdf]	[laser]	⭐️⭐️
2023.12	[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)	[pdf]	[PowerInfer]	⭐️

Date	Title	Paper	Code	Recom
2022.11	🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft)	[pdf]	[FasterTransformer]	⭐️⭐️
2023.12	🔥 [Mixtral Offloading] Fast Inference of Mixture-of-Experts Language Models with Offloading(@Moscow Institute of Physics and Technology etc)	[pdf]	[mixtral-offloading]	⭐️⭐️

Date	Title	Paper	Code	Recom
2023.03	[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc)	[pdf]	[FlexGen]	⭐️
2023.11	[LLM CPU Inference] Efficient LLM Inference on CPUs(@intel)	[pdf]	[intel-extension-for-transformers]	⭐️
2023.12	[LinguaLinked] LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices(@University of California Irvine)	[pdf]	⚠️	⭐️
2023.12	[OpenVINO] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc)	[pdf]	⚠️	⭐️

Date	Title	Paper	Code	Recom
2023.05	🔥🔥[RWKV] RWKV: Reinventing RNNs for the Transformer Era(@Bo Peng etc)	[pdf]	[RWKV-LM]	⭐️⭐️
2023.12	🔥🔥[Mamba] Mamba: Linear-Time Sequence Modeling with Selective State Spaces(@cs.cmu.edu etc)	[pdf]	[mamba]	⭐️⭐️

Date	Title	Paper	Code	Recom
2018.03	[Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision(@KTH Royal etc)	[pdf]	⚠️	⭐️
2022.09	[FP8] FP8 FORMATS FOR DEEP LEARNING(@NVIDIA)	[pdf]	⚠️	⭐️
2023.08	[Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library(@Tokyo Institute etc)	[pdf]	[wmma_extension]	⭐️

Date	Title	Paper	Code	Recom
2021.04	🔥[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING(@Zhuiyi Technology Co., Ltd.)	[pdf]	[transformers]	⭐️
2022.10	[ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs(@ByteDance&NVIDIA)	[pdf]	[ByteTransformer]	⭐️

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to submit a PR to this repo!

lilujunai / Awesome-LLM-Inference

📒Introduction

©️Citations

🎉Download PDFs

📙Awesome LLM Inference Papers with Codes

📖Contents

📖LLM Algorithmic/Eval Survey (©️back👆🏻)

📖LLM Train/Inference Framework (©️back👆🏻)

📖Continuous/In-flight Batching (©️back👆🏻)

📖Weight/Activation Quantize/Compress (©️back👆🏻)

📖IO/FLOPs-Aware/Sparse Attention (©️back👆🏻)

📖KV Cache Scheduling/Quantize/Dropping (©️back👆🏻)

📖Early-Exit/Intermediate Layer Decoding (©️back👆🏻)

📖Parallel Decoding/Sampling (©️back👆🏻)

📖Structured Prune/KD/Weight Sparse (©️back👆🏻)

📖Mixture-of-Experts(MoE) LLM Inferencen (©️back👆🏻)

📖CPU/Single GPU/Mobile Inference (©️back👆🏻)

📖Non Transformer Architecture (©️back👆🏻)

📖GEMM、Tensor Cores、WMMA (©️back👆🏻)

📖Position Embed、Others (©️back👆🏻)

©️License

🎉Contribute

About