ZZK's repositories
BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
EETQ
Easy and Efficient Quantization for Transformers
fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
faster-nougat
Implementation of nougat that focuses on processing pdf locally.
flux
A fast communication-overlapping library for tensor parallelism on GPUs.
kvikio
KvikIO - High Performance File IO
lightning-thunder
Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
LLM101n
LLM101n: Let's build a Storyteller
MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
nvmath-python
NVIDIA Math Libraries for the Python Ecosystem
one-api
OpenAI 接口管理 & 分发系统,支持 Azure、Anthropic Claude、Google PaLM 2 & Gemini、智谱 ChatGLM、百度文心一言、讯飞星火认知、阿里通义千问、360 智脑以及腾讯混元,可用于二次分发管理 key,仅单可执行文件,已打包好 Docker 镜像,一键部署,开箱即用. OpenAI key management & redistribution system, using a single API for all LLMs, and features an English UI.
QuaRot
Code for QuaRot, an end-to-end 4-bit inference of large language models.
sarathi-serve
A low-latency & high-throughput serving engine for LLMs
SpeculativeDecodingPapers
📰 Must-read papers and blogs on Speculative Decoding ⚡️
SpinQuant
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
TensorRT-Model-Optimizer
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
ThunderKittens
Tile primitives for speedy kernels
tiny-gpu
A minimal GPU design in Verilog to learn how GPUs work from the ground up
triton-linalg
Development repository for the Triton-Linalg conversion
unsloth
Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
vidur
A large-scale simulation framework for LLM inference