There are 40 repositories under quantization topic.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
Faster Whisper transcription with CTranslate2
[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant
Accessible large language models via k-bit quantization for PyTorch.
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Fast inference engine for Transformer models
Sparsity-aware deep learning inference runtime for CPUs
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)
Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
Run Mixtral-8x7B models in Colab or consumer desktops
micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
OpenMMLab Model Compression Toolbox and Benchmark.
PaddleSlim is an open-source library for deep model compression and architecture search.
A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
Efficient computing methods developed by Huawei Noah's Ark Lab
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
Neural Network Compression Framework for enhanced OpenVINO™ inference
Mastering Applied AI, One Concept at a Time