cccpr's starred repositories
flash-attention
Fast and memory-efficient exact attention
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
lm-evaluation-harness
A framework for few-shot evaluation of language models.
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
GPTQ-for-LLaMa
4 bits quantization of LLaMA using GPTQ
smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Awesome-LLM-Compression
Awesome LLM compression research papers and tools.
llama-chat
Chat with Meta's LLaMA models at home made easy
Outlier_Suppression_Plus
Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
llm-mixed-q
mixed-precision quantization for LLMs