ZZK's repositories
fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
flux
A fast communication-overlapping library for tensor parallelism on GPUs.
ktransformers
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
kvikio
KvikIO - High Performance File IO
LLM101n
LLM101n: Let's build a Storyteller
marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
nvmath-python
NVIDIA Math Libraries for the Python Ecosystem
one-api
OpenAI 接口管理 & 分发系统,支持 Azure、Anthropic Claude、Google PaLM 2 & Gemini、智谱 ChatGLM、百度文心一言、讯飞星火认知、阿里通义千问、360 智脑以及腾讯混元,可用于二次分发管理 key,仅单可执行文件,已打包好 Docker 镜像,一键部署,开箱即用. OpenAI key management & redistribution system, using a single API for all LLMs, and features an English UI.
QuaRot
Code for QuaRot, an end-to-end 4-bit inference of large language models.
sarathi-serve
A low-latency & high-throughput serving engine for LLMs
SpeculativeDecodingPapers
📰 Must-read papers and blogs on Speculative Decoding ⚡️
SpinQuant
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
TensorRT-Model-Optimizer
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
triton-linalg
Development repository for the Triton-Linalg conversion
unsloth
Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
vidur
A large-scale simulation framework for LLM inference