Awesome LLM Compression

Awesome LLM compression research papers and tools to accelerate LLM training and inference.

📑 Papers
🔧 Tools
🙌 Contributing
🌟 Star History

Papers

Survey

A Survey on Model Compression for Large Language Models
Arxiv 2023 [Paper]
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
Arxiv 2023 [Paper]
Efficient Large Language Models: A Survey
Arxiv 2023 [Paper] [GitHub Page]
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Arxiv 2023 [Paper]
Understanding LLMs: A Comprehensive Overview from Training to Inference
Arxiv 2024 [Paper]
A Survey of Resource-efficient LLM and Multimodal Foundation Models
Arxiv 2024 [Paper]
A Survey on Hardware Accelerators for Large Language Models
Arxiv 2024 [Paper]
A Comprehensive Survey of Compression Algorithms for Language Models
Arxiv 2024 [Paper]
Model Compression and Efficient Inference for Large Language Models: A Survey
Arxiv 2024 [Paper]
A Survey on Knowledge Distillation of Large Language Models
Arxiv 2024 [Paper] [GitHub Page]
Efficient Prompting Methods for Large Language Models: A Survey
Arxiv 2024 [Paper]

Quantization

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
NeurIPS 2022 [Paper] [Code (DeepSpeed)]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
NeurIPS 2022 [Paper] [Code]
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
NeurIPS 2022 [Paper] [Code]
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Arxiv 2022 [Paper]
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023 [Paper] [Code]
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
ICML 2023 [Paper] [Code (DeepSpeed)]
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
ICML 2023 [Paper] [Code]
The case for 4-bit precision: k-bit Inference Scaling Laws
ICML 2023 [Paper]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023 [Paper] [Code]
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
ACL 2023 [Paper]
Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization
ACL 2023 [Paper]
QLoRA: Efficient Finetuning of Quantized LLMs
NeurIPS 2023 [Paper] [Code]
The Quantization Model of Neural Scaling
NeurIPS 2023 [Paper]
Quantized Distributed Training of Large Models with Convergence Guarantees
Arxiv 2023 [Paper]
RPTQ: Reorder-based Post-training Quantization for Large Language Models
Arxiv 2023 [Paper] [Code]
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Arxiv 2023 [Paper] [Code]
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Arxiv 2023 [Paper]
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
NeurIPS 2023 [Paper]
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
Arxiv 2023 [Paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Arxiv 2023 [Paper] [Code]
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Arxiv 2023 [Paper] [Code]
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Arxiv 2023 [Paper] [Code]
OWQ: Lessons learned from activation outliers for weight quantization in large language models
Arxiv 2023 [Paper]
SqueezeLLM: Dense-and-Sparse Quantization
Arxiv 2023 [Paper] [Code]
INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
Arxiv 2023 [Paper]
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
Arxiv 2023 [Paper]
INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
Arxiv 2023 [Paper] [Code]
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
Arxiv 2023 [Paper] [Code]
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
Arxiv 2023 [Paper]
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Arxiv 2023 [Paper] [Code (DeepSpeed)]
OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
ISCA 2023 [Paper]
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
Arxiv 2023 [Paper]
GPT-Zip: Deep Compression of Finetuned Large Language Models
ICML 2023 Workshop ES-FoMO [Paper]
Generating Efficient Kernels for Quantized Inference on Large Language Models
ICML 2023 Workshop ES-FoMO [Paper]
Gradient-Based Post-Training Quantization: Challenging the Status Quo
Arxiv 2023 [Paper]
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
Arxiv 2023 [Paper]
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
ICLR 2024 [Paper] [Code]
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
Arxiv 2023 [Paper]
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
Arxiv 2023 [Paper]
QuantEase: Optimization-based Quantization for Language Models -- An Efficient and Intuitive Algorithm
Arxiv 2023 [Paper]
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Arxiv 2023 [Paper]
Understanding the Impact of Post-Training Quantization on Large-scale Language Models
Arxiv 2023 [Paper]
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Arxiv 2023 [Paper]
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Arxiv 2023 [Paper] [Code]
Efficient Post-training Quantization with FP8 Formats
Arxiv 2023 [Paper] [Code (Intel® Neural Compressor)]
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Arxiv 2023 [Paper] [Code]
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
Arxiv 2023 [Paper]
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
Arxiv 2023 [Paper]
PB-LLM: Partially Binarized Large Language Models
Arxiv 2023 [Paper] [Code]
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
Arxiv 2023 [Paper]
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
Arxiv 2023 [Paper]
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
Arxiv 2023 [Paper]
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Arxiv 2023 [Paper]
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
Arxiv 2023 [Paper]
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Arxiv 2023 [Paper] [Code (Intel® Neural Compressor)]
BitNet: Scaling 1-bit Transformers for Large Language Models
Arxiv 2023 [Paper] [Code]
FP8-LM: Training FP8 Large Language Models
Arxiv 2023 [Paper] [Code]
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Arxiv 2023 [Paper] [Code]
AFPQ: Asymmetric Floating Point Quantization for LLMs
Arxiv 2023 [Paper] [Code]
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
Arxiv 2023 [Paper]
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Arxiv 2023 [Paper]
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Arxiv 2023 [Paper]
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
Arxiv 2023 [Paper]
How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
Arxiv 2023 [Paper]
A Speed Odyssey for Deployable Quantization of LLMs
Arxiv 2023 [Paper]
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
Arxiv 2023 [Paper]
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
NeurIPS 2023 [Paper] [Code]
Efficient LLM Inference on CPUs
NeurIPS 2023 on Efficient Natural Language and Speech Processing [Paper] [Code]
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
EMNLP Findings 2023 [Paper]
Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models
EMNLP 2023 [Paper]
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
EMNLP 2023 [Paper] [Code]
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
EMNLP 2023 [Paper]
Watermarking LLMs with Weight Quantization
EMNLP 2023 [Paper] [Code]
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization
EMNLP 2023 [Paper]
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
EMNLP 2023 [Paper] [Code]
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
AAAI 2024 [Paper]
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
Arxiv 2023 [Paper]
CBQ: Cross-Block Quantization for Large Language Models
Arxiv 2023 [Paper]
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Arxiv 2023 [Paper]
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
NeurIPS 2023 [Paper] [Code]
A Performance Evaluation of a Quantized Large Language Model on Various Smartphones
Arxiv 2023 [Paper]
DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
Arxiv 2023 [Paper] [Code]
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA
Arxiv 2024 [Paper]
Extreme Compression of Large Language Models via Additive Quantization
Arxiv 2024 [Paper]
Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
Arxiv 2024 [Paper]
Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models
Arxiv 2024 [Paper]
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
Arxiv 2024 [Paper]
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Arxiv 2024 [Paper]
Can Large Language Models Understand Context?
Arxiv 2024 [Paper]
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
Arxiv 2024 [Paper] [Code]
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Arxiv 2024 [Paper]
LQER: Low-Rank Quantization Error Reconstruction for LLMs
Arxiv 2024 [Paper]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Arxiv 2024 [Paper] [Code]
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Arxiv 2024 [Paper] [Code]
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
Arxiv 2024 [Paper] [Code]
L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ
Arxiv 2024 [Paper]
TP-Aware Dequantization
Arxiv 2024 [Paper]
ApiQ: Finetuning of 2-Bit Quantized Large Language Model
Arxiv 2024 [Paper]
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
Arxiv 2024 [Paper] [Code]
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Arxiv 2024 [Paper] [Code]
QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning
AAAI EIW Workshop 2024 [Paper]
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Arxiv 2024 [Paper]
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
Arxiv 2024 [Paper] [Code]
OneBit: Towards Extremely Low-bit Large Language Models
Arxiv 2024 [Paper]
DB-LLM: Accurate Dual-Binarization for Efficient LLMs
Arxiv 2024 [Paper]
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More
Arxiv 2024 [Paper]
GPTVQ: The Blessing of Dimensionality for LLM Quantization
Arxiv 2024 [Paper] [Code]
APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
DAC 2024 [Paper]
A Comprehensive Evaluation of Quantization Strategies for Large Language Models
DAC 2024 [Paper]
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
Arxiv 2024 [Paper]
Evaluating Quantized Large Language Models
Arxiv 2024 [Paper]
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
Arxiv 2024 [Paper]
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
Arxiv 2024 [Paper]
IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
Arxiv 2024 [Paper]
On the Compressibility of Quantized Large Language Models
Arxiv 2024 [Paper]
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
Arxiv 2024 [Paper]
QAQ: Quality Adaptive Quantization for LLM KV Cache
Arxiv 2024 [Paper] [Code]
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
Arxiv 2024 [Paper]
What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
Arxiv 2024 [Paper]
SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression
Arxiv 2024 [Paper] [Code]
AffineQuant: Affine Transformation Quantization for Large Language Models
ICLR 2024 [Paper] [Code]
Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models
ICLR Practical ML for Low Resource Settings Workshop 2024 [Paper]
Accurate Block Quantization in LLMs with Outliers
Arxiv 2024 [Paper]
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Arxiv 2024 [Paper] [Code]
Minimize Quantization Output Error with Bias Compensation
Arxiv 2024 [Paper] [Code]
Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
Arxiv 2024 [Paper]
Increased LLM Vulnerabilities from Fine-tuning and Quantization
Arxiv 2024 [Paper]
Quantization of Large Language Models with an Overdetermined Basis
Arxiv 2024 [Paper]
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Arxiv 2024 [Paper] [Code] [Model]
How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training
Arxiv 2024 [Paper]
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
Arxiv 2024 [Paper] [Code]
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Arxiv 2024 [Paper]
When Quantization Affects Confidence of Large Language Models?
NAACL 2024 [Paper]
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Arxiv 2024 [Paper] [Code]
Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs
ICML 2024 [Paper]
LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models
Arxiv 2024 [Paper] [Code]
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
Arxiv 2024 [Paper]
Combining multiple post-training techniques to achieve most efficient quantized LLMs
Arxiv 2024 [Paper]
Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization
Arxiv 2024 [Paper]

Pruning and Sparsity

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
ICLR 2023 [Paper]
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
ICML 2023 [Paper] [Code]
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
ICML 2023 [Paper] [Code]
LLM-Pruner: On the Structural Pruning of Large Language Models
NeurIPS 2023 [Paper] [Code]
ZipLM: Inference-Aware Structured Pruning of Language Models
NeurIPS 2023 [Paper] [Code]
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
NeurIPS 2023 [Paper] [Code]
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
NeurIPS 2023 [Paper]
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
NeurIPS 2023 [Paper] [Code]
Learning to Compress Prompts with Gist Tokens
NeurIPS 2023 [Paper]
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
NeurIPS 2023 [Paper]
Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
ICLR 2023 TinyPapers [Paper]
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Arxiv 2023 [Paper] [Code]
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Arxiv 2023 [Paper]
Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
Arxiv 2023 [Paper] [Code]
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
ACL 2023 [Paper] [Code]
Structured Pruning for Efficient Generative Pre-trained Language Models
ACL 2023 [Paper]
A Simple and Effective Pruning Approach for Large Language Models
Arxiv 2023 [Paper] [Code]
Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Arxiv 2023 [Paper]
Structural pruning of large language models via neural architecture search
AutoML 2023 [Paper]
Pruning Large Language Models via Accuracy Predictor
ICASSP 2024 [Paper]
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
VLDB 2024 [Paper] [Cde]
Compressing LLMs: The Truth is Rarely Pure and Never Simple
Arxiv 2023 [Paper]
Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity
Arxiv 2023 [Paper] [Code]
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Arxiv 2023 [Paper]
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
Arxiv 2023 [Paper] [Code]
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Arxiv 2023 [Paper] [Code]
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Arxiv 2023 [Paper] [Code]
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
Arxiv 2023 [Paper] [Code]
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
ICASSP 2024 [Paper]
Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
EMNLP 2023 Findings [Paper]
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
EMNLP Findings 2023 [Paper]
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
Arxiv 2023 [Paper]
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
Arxiv 2023 [Paper]
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Arxiv 2023 [Paper]
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
Arxiv 2023 [Paper]
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models
Arxiv 2023 [Paper] [Code]
How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
Arxiv 2023 [Paper]
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
OpenReview [Paper] [Code]
PUSHING GRADIENT TOWARDS ZERO: A NOVEL PRUNING METHOD FOR LARGE LANGUAGE MODELS
OpenReview 2023 [Paper]
An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models
Preprints 2023 [Paper]
Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
Arxiv 2023 [Paper] [Code]
LORAPRUNE: PRUNING MEETS LOW-RANK PARAMETER-EFFICIENT FINE-TUNING
Arxiv 2023 [Paper]
Mini-GPTs: Efficient Large Language Models through Contextual Pruning
Arxiv 2023 [Paper] [Code]
The LLM Surgeon
Arxiv 2023 [Paper]
Fluctuation-based Adaptive Structured Pruning for Large Language Models
AAAI 2024 [Paper]
How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark
CPAL 2024 [Paper]
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs
Arxiv 2023 [Paper]
Fast and Optimal Weight Update for Pruned Large Language Models
Arxiv 2024 [Paper]
APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference
Arxiv 2024 [Paper]
Scaling Sparse Fine-Tuning to Large Language Models
Arxiv 2024 [Paper]
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
ICLR 2024 [Paper] [Code]
Shortened LLaMA: A Simple Depth Pruning for Large Language Models
Arxiv 2024 [Paper]
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
Arxiv 2024 [Paper] [Code]
NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
Arxiv 2024 [Paper]
LaCo: Large Language Model Pruning via Layer Collapse
Arxiv 2024 [Paper]
Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers
Arxiv 2024 [Paper]
EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs
Arxiv 2024 [Paper] [Code]
Data-free Weight Compress and Denoise for Large Language Models
Arxiv 2024 [Paper]
Gradient-Free Adaptive Global Pruning for Pre-trained Language Models
Arxiv 2024 [Paper]
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Arxiv 2024 [Paper]
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Arxiv 2024 [Paper] [Code]
Compressing Large Language Models by Streamlining the Unimportant Layer
Arxiv 2024 [Paper]
LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models
Arxiv 2024 [Paper]
Shears: Unstructured Sparsity with Neural Low-rank Adapter Search
NAACL 2024 [Paper]
Eigenpruning
NAACL 2024 Abstract [Paper]
OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning
Arxiv 2024 [Paper]
Pruning as a Domain-specific LLM Extractor
NAACL 2024 Findings [Paper] [Code]
Differentiable Model Scaling using Differentiable Topk
ICML 2024 [Paper]
COPAL: Continual Pruning in Large Language Generative Models
ICML 2024 [Paper]
Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization
ACL Findings 2024 [Paper]

Distillation

Lifting the Curse of Capacity Gap in Distilling Language Models
ACL 2023 [Paper] [Code]
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
ACL 2023 [Paper]
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
ACL 2023 [Paper]
SCOTT: Self-Consistent Chain-of-Thought Distillation
ACL 2023 [Paper]
DISCO: Distilling Counterfactuals with Large Language Models
ACL 2023 [Paper] [Code]
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Arxiv 2023 [Paper] [Code]
How To Train Your (Compressed) Large Language Model
Arxiv 2023 [Paper]
The False Promise of Imitating Proprietary LLMs
Arxiv 2023 [Paper]
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
Arxiv 2023 [Paper] [Code]
PaD: Program-aided Distillation Specializes Large Models in Reasoning
Arxiv 2023 [Paper]
MiniLLM: Knowledge Distillation of Large Language Models
ICLR 2024 [Paper] [Code]
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
ICLR 2024 [Paper]
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
Arxiv 2023 [Paper]
Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction
Arxiv 2023 [Paper]
Task-agnostic Distillation of Encoder-Decoder Language Models
Arxiv 2023 [Paper]
Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA
Arxiv 2023 [Paper]
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
CoNLL 2023 [Paper] [Code]
Can a student Large Language Model perform as well as it's teacher?
Arxiv 2023 [Paper]
Multistage Collaborative Knowledge Distillation from Large Language Models
Arxiv 2023 [Paper]
Lion: Adversarial Distillation of Closed-Source Large Language Model
EMNLP 2023 [Paper] [Code]
MCC-KD: Multi-CoT Consistent Knowledge Distillation
EMNLP 2023 [Paper]
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation
EMNLP 2023 [Paper]
YODA: Teacher-Student Progressive Learning for Language Models
Arxiv 2023 [Paper]
Knowledge Fusion of Large Language Models
ICLR 2024 [Paper] [Code]
Knowledge Distillation for Closed-Source Language Models
Arxiv 2024 [Paper]
TinyLLM: Learning a Small Student from Multiple Large Language Models
Arxiv 2024 [Paper]
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Arxiv 2024 [Paper]
Revisiting Knowledge Distillation for Autoregressive Language Models
Arxiv 2024 [Paper]
Sinkhorn Distance Minimization for Knowledge Distillation
COLING 2024 [Paper]
Divide-or-Conquer? Which Part Should You Distill Your LLM?
Arxiv 2024 [Paper]
Learning to Maximize Mutual Information for Chain-of-Thought Distillation
Arxiv 2024 [Paper]
DistiLLM: Towards Streamlined Distillation for Large Language Models
Arxiv 2024 [Paper] [Code]
Efficiently Distilling LLMs for Edge Applications
NAACL 2024 [Paper]
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
Arxiv 2024 [Paper]
Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs
Arxiv 2024 [Paper]

Efficient Prompting

Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
ACL 2023 [Paper] [Code]
Batch Prompting: Efficient Inference with Large Language Model APIs
EMNLP 2023 [Paper] [Code]
Adapting Language Models to Compress Contexts
EMNLP 2023 [Paper] [Code]
Compressing Context to Enhance Inference Efficiency of Large Language Models
EMNLP 2023 [Paper] [Code]
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
EMNLP 2023 [Paper] [Code]
Vector-Quantized Prompt Learning for Paraphrase Generation
EMNLP 2023 Findings [Paper]
Efficient Prompting via Dynamic In-Context Learning
Arxiv 2023 [Paper]
Learning to Compress Prompts with Gist Tokens
Arxiv 2023 [Paper] [Code]
In-context Autoencoder for Context Compression in a Large Language Model
Arxiv 2023 [Paper]
Discrete Prompt Compression with Reinforcement Learning
Arxiv 2023 [Paper]
BatchPrompt: Accomplish more with less
Arxiv 2023 [Paper]
(Dynamic) Prompting might be all you need to repair Compressed LLMs
Arxiv 2023 [Paper]
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
Arxiv 2023 [Paper] [Code]
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
Arxiv 2023 [Paper] [Code]
Extending Context Window of Large Language Models via Semantic Compression
Arxiv 2023 [Paper]
Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning
Arxiv 2023 [Paper]
The Impact of Reasoning Step Length on Large Language Models
Arxiv 2024 [Paper]
Compressed Context Memory For Online Language Model Interaction
ICLR 2024 [Paper] [Code]
Learning to Compress Prompt in Natural Language Formats
Arxiv 2024 [Paper]
Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression
Arxiv 2024 [Paper] [Code]
StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses
Arxiv 2024 [Paper]
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Arxiv 2024 [Paper] [Code]
PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models
Arxiv 2024 [Paper] [Code]
PROMPT-SAW: Leveraging Relation-Aware Graphs for Textual Prompt Compression
Arxiv 2024 [Paper]
Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization
Arxiv 2024 [Paper] [Code]
Adapting LLMs for Efficient Context Processing through Soft Prompt Compression
Arxiv 2024 [Paper]
Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation
Arxiv 2024 [Paper]

Other

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Arxiv 2022 [Paper]
TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
Arxiv 2023 [Paper]
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Arxiv 2023 [Paper]
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
Arxiv 2023 [Paper]
Scaling In-Context Demonstrations with Structured Attention
Arxiv 2023 [Paper]
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Arxiv 2023 [Paper] [Code]
CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
Arxiv 2023 [Paper]
Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping
Arxiv 2023 [Paper]
LLMCad: Fast and Scalable On-device Large Language Model Inference
Arxiv 2023 [Paper]
vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
Arxiv 2023 [Paper]
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Arxiv 2023 [Paper] [Code]
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
Arxiv 2023 [Paper] [Code]
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Arxiv 2023 [Paper]
Efficient Streaming Language Models with Attention Sinks
Arxiv 2023 [Paper] [Code]
Efficient Large Language Models Fine-Tuning On Graphs
Arxiv 2023 [Paper]
SparQ Attention: Bandwidth-Efficient LLM Inference
Arxiv 2023 [Paper]
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models
Arxiv 2023 [Paper]
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Arxiv 2023 [Paper] [Code]
Dataset Quantization
ICCV 2023 [Paper] [Code]
Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
NeurIPS 2023 [Paper] [Code]
Context Compression for Auto-regressive Transformers with Sentinel Tokens
EMNLP 2023 [Paper] [Code]
TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
EMNLP 2023 Findings [Paper]
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression
EMNLP 2023 Findings [Paper]
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
Arxiv 2024 [Paper]
LoMA: Lossless Compressed Memory Attention
Arxiv 2024 [Paper]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Arxiv 2024 [Paper] [Code]
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Arxiv 2024 [Paper] [Code]
CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks
Arxiv 2024 [Paper]
BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models
Arxiv 2024 [Paper] [Code]
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Arxiv 2024 [Paper]
Not all Layers of LLMs are Necessary during Inference
Arxiv 2024 [Paper]
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Arxiv 2024 [Paper]
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Arxiv 2024 [Paper]
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
HPCA 2024 [Paper]
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
MLSys 2024 [Paper]
ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models
Arxiv 2024 [Paper]
Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation
Arxiv 2024 [Paper]
Training LLMs over Neurally Compressed Text
Arxiv 2024 [Paper]
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Arxiv 2024 [Paper] [Code]
SnapKV: LLM Knows What You are Looking for Before Generation
Arxiv 2024 [Paper] [Code]
Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models
Arxiv 2024 [Paper]
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
ICML 2024 [Paper]
Token-wise Influential Training Data Retrieval for Large Language Models
ACL 2024 [Paper] [Code]

Tools

BMCook: Model Compression for Big Models [Code]
llama.cpp: Inference of LLaMA model in pure C/C++ [Code]
LangChain: Building applications with LLMs through composability [Code]
GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [Code]
Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [Code]
vllm: A high-throughput and memory-efficient inference and serving engine for LLMs [Code]
LLaMA Efficient Tuning: Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA) [Code]
gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python. [Code]
Efficient-Tuning-LLMs: (Efficient Finetuning of QLoRA LLMs). QLoRA, LLama, bloom, baichuan-7B, GLM [Code]
bitsandbytes: 8-bit CUDA functions for PyTorch [Code]
ExLlama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights. [Code]
lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]
Lit-LLaMA: Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]
lama.onnx: LLaMa/RWKV onnx models, quantization and testcase [Code]
fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend. [Code]
Sparsebit: A model compression and acceleration toolbox based on pytorch. [Code]
llama2.c: Inference Llama 2 in one file of pure C [Code]
Megatron-LM: Ongoing research training transformer models at scale [Code]
ggml: Tensor library for machine learning [Code]
LLamaSharp: C#/.NET binding of llama.cpp, including LLaMa/GPT model inference and quantization, ASP.NET core integration and UI [Code]
rwkv.cpp: NT4/INT5/INT8 and FP16 inference on CPU for RWKV language model [Code]
Can my GPU run this LLM?: Calculate GPU memory requirement & breakdown for training/inference of LLM models. Supports ggml/bnb quantization [Code]
TinyChatEngine: On-Device LLM Inference Library [Code]
TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. [Code]
IntLLaMA: A fast and light quantization solution for LLaMA [Code]
EasyLLM: Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency [Code]
GreenBit LLaMA: Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs [Code]
Intel® Neural Compressor: An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet) [Code]
LLM-Viewer: Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface. [Code]
LLaMA3-Quantization: A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods. [Code]
LLamaSharp: A C#/.NET library to run LLM models (🦙LLaMA/LLaVA) on your local device efficiently. [Code]
Green-bit-LLM: A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs. [Code] [Model]
Bitorch Engine: Streamlining AI with Open-Source Low-Bit Quantization. [Code]
LLaMA-Factory: Unify Efficient Fine-Tuning of 100+ LLMs [Code]

Contributing

This is an active repository and your contributions are always welcome! Before you add papers/tools into the awesome list, please make sure that:

The paper or tools is related to Large Language Models (LLMs). If the compression algorithms or tools are only evaluated on small-scale language models (e.g., BERT), they should not be included in the list.
The paper should be inserted in the correct position in chronological order (publication/arxiv release time).
The link to [Paper] should be the arxiv page, not the pdf page if this is a paper posted on arxiv.

HuangOwen / Awesome-LLM-Compression