SparseGPT for LLaMA

This repository contains code to reproduce the key results of the paper SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot, now adapted to LLaMA.

Specifically, it provides scripts and implementations to:

Evaluate baseline and pruned models on raw-WikiText2, PTB and C4-subset. (datautils.py, opt.py, bloom.py, llama.py)
Perform unstructured, n:m and sparse + quantized SparseGPT compression on OPT, BLOOM, and LLaMA models. (sparsegpt.py, opt.py, bloom.py, llama.py)

We note that this SparseGPT implementation is based on IST-DASLab's open-source GPTQ code.

Perplexity Results (Lower is better)

Model	Bits	Sparsity ratio	RAM (GiB)	VRAM (GiB)	wikitext2	ptb	C4
LLaMa-7B	16	50% uniform	15	8.5	7.21254	10.96087	8.5896
LLaMa-13B	16	50% uniform	27	12	6.20875	9.33356	7.6749
LLaMa-33B	16	50% uniform	63	16	5.3358	8.1773	6.922
LLaMa-65B	16	50% uniform	127	25.5	4.60178	7.52578	6.32754

LLaMA 65B evaluation results provided by seggybop.

Dependencies

torch: tested on v1.10.1+cu111
transformers: tested on v4.21.2
datasets: tested on v1.17.0
wandb
dataset

Usage

Here are some sample commands to run baselines and sparsification on LLaMA models, followed by perplexity evaluations on raw-WikiText2, PTB and C4. See also the CMD-argument documentation.

# Run dense baseline
python llama.py decapoda-research/llama-7b-hf c4

# Run magnitude baseline
python llama.py decapoda-research/llama-7b-hf c4 --sparsity .5 --gmp

# Prune to 50\% uniform sparsity with SparseGPT
python llama.py decapoda-research/llama-7b-hf c4 --sparsity .5

# Prune to full 2:4 sparsity with SparseGPT and save the model
python llama-test.py decapoda-research/llama-7b-hf --prunen 2 --prunem 4 --save /path/to/model.pt

# Prune to 50\% + 4-bit with SparseGPT -- Currently not working
python llama.py decapoda-research/llama-7b-hf --sparsity .5 --wbits 4

To run on other LLaMA models, replace "decapoda-research/llama-7b-h" by the HuggingFace name of the corresponding model.

Cite

If you found this work useful, please consider citing:

@article{frantar-sparsegpt,
  title={{SparseGPT}: Massive Language Models Can Be Accurately Pruned in One-Shot}, 
  author={Elias Frantar and Dan Alistarh},
  year={2023},
  journal={arXiv preprint arXiv:2301.00774}
}

About

Code for the paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot" with LLaMA implementation.

https://arxiv.org/abs/2301.00774

Apache License 2.0

Languages

Language:Python 100.0%