MARD1NO

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonMIT000

mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

Language:C++MIT000

nvmath-python

NVIDIA Math Libraries for the Python Ecosystem

Language:CythonApache-2.0000

OpenAI 接口管理 & 分发系统，支持 Azure、Anthropic Claude、Google PaLM 2 & Gemini、智谱 ChatGLM、百度文心一言、讯飞星火认知、阿里通义千问、360 智脑以及腾讯混元，可用于二次分发管理 key，仅单可执行文件，已打包好 Docker 镜像，一键部署，开箱即用. OpenAI key management & redistribution system, using a single API for all LLMs, and features an English UI.

Language:JavaScriptMIT000

QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Language:PythonApache-2.0000

quip-sharp

Language:PythonGPL-3.0000

sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Language:PythonApache-2.0000

SpeculativeDecodingPapers

📰 Must-read papers and blogs on Speculative Decoding ⚡️

Apache-2.0000

SpinQuant

Code repo for the paper "SpinQuant LLM quantization with learned rotations"

Language:PythonNOASSERTION000

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++Apache-2.0000

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonNOASSERTION000

MARD1NO

ZZK's repositories

TiledCUDA

Nanoflow

nvbench

fast-hadamard-transform

FLASHNN

flux

kernels

KsanaLLM

ktransformers

kvikio

LLM101n

llumnix

MARD1NO.github.io

marlin

MInference

Mooncake

mscclpp

nvmath-python

one-api

QuaRot

quip-sharp

sarathi-serve

SpeculativeDecodingPapers

SpinQuant

TensorRT-LLM

TensorRT-Model-Optimizer

triton-linalg

unsloth

vattention

vidur