MARD1NO

ZZK's repositories

gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

BSD-3-Clause100

cutlass_master

CUDA Templates for Linear Algebra Subroutines

Language:C++NOASSERTION000

APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.

MIT000

attorch

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Language:PythonMIT000

auto-round

SOTA Weight-only Quantization Algorithm for LLMs

Language:PythonApache-2.0000

BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

MIT000

CacheGen

Language:Python000

cccl

CUDA C++ Core Libraries

Language:C++NOASSERTION000

cudnn-frontend

cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it

Language:C++MIT000

EETQ

Easy and Efficient Quantization for Transformers

Language:C++000

faster-nougat

Implementation of nougat that focuses on processing pdf locally.

Language:Python000

fp6_llm

An efficient GPU support for LLM inference with 6-bit quantization (FP6).

Apache-2.0000

GPUSorting

OneSweep, implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

NOASSERTION000

KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Language:PythonMIT000

KsanaLLM

NOASSERTION000

KVQuant

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

000

lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Language:PythonApache-2.0000

LLMRoofline

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Language:Python000

MARD1NO

010

MARD1NO.github.io

Language:HTML01 1

open-gpu-kernel-modules

NVIDIA Linux open GPU with P2P support

Language:CNOASSERTION000

py-codegen

000

qllm-eval

Code Repository of Evaluating Quantized Large Language Models

MIT000

quanto

A pytorch Quantization Toolkit

Apache-2.0000

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

NOASSERTION000

MARD1NO

ZZK's repositories

cute-gemm

gpt-fast

cutlass_master

APPy

attorch

auto-round

BitBLAS

CacheGen

cccl

cudnn-frontend

EETQ

faster-nougat

fp6_llm

GPUSorting

KIVI

KsanaLLM

KVQuant

lightning-thunder

LLMRoofline

MARD1NO

MARD1NO.github.io

open-gpu-kernel-modules

py-codegen

qllm-eval

quanto

TensorRT-Model-Optimizer

ThunderKittens

tiny-gpu

triton

Triton-Puzzles