MARD1NO

ZZK's repositories

flashinfer

FlashInfer: Kernel Library for LLM Serving

Language:CudaApache-2.0100

gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Language:PythonBSD-3-Clause100

cutlass_master

CUDA Templates for Linear Algebra Subroutines

Language:C++NOASSERTION000

APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.

Language:PythonMIT000

attorch

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Language:PythonMIT000

auto-round

SOTA Weight-only Quantization Algorithm for LLMs

Language:PythonApache-2.0000

BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Language:PythonMIT000

CacheGen

Language:Python000

cccl

CUDA C++ Core Libraries

Language:C++NOASSERTION000

cudnn-frontend

cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it

Language:C++MIT000

EETQ

Easy and Efficient Quantization for Transformers

Language:C++000

float8_experimental

This repository contains the experimental PyTorch native float8 training UX

BSD-3-Clause000

fp6_llm

An efficient GPU support for LLM inference with 6-bit quantization (FP6).

Apache-2.0000

gemma_pytorch

The official PyTorch implementation of Google's Gemma models

Language:PythonApache-2.0000

GPUSorting

OneSweep, implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

Language:HLSLNOASSERTION000

KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Language:PythonMIT000

KVQuant

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Language:Python000

lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Language:PythonApache-2.0000