mgoin

followers

following

stars

@neuralmagic

Boston

https://www.linkedin.com/in/michael-goin/

Organizations

neuralmagic

Michael Goin's starred repositories

flash-attention

Fast and memory-efficient exact attention

Language:PythonBSD-3-Clause14208 119 1116

shadPS4

PS4 emulator for Windows,Linux,MacOS

Language:C++GPL-2.010924 128 525

xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.

Language:PythonNOASSERTION8645 76 556

cutlass

CUDA Templates for Linear Algebra Subroutines

Language:C++NOASSERTION5659 109 1130

Liger-Kernel

Efficient Triton Kernels for LLM Training

Language:PythonBSD-2-Clause3424 39 98

ThunderKittens

Tile primitives for speedy kernels

Language:CudaMIT1652 29 27

ao

PyTorch native quantization and sparsity for training and inference

Language:PythonBSD-3-Clause1566 40 293

evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Language:PythonApache-2.01246 8 186

dune

A shell🐚 by the beach🏖️!

Language:RustMIT1021 15 44

llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Language:PythonApache-2.0675 12 90

Nanoflow

A throughput-oriented high-performance serving framework for LLMs

Language:CudaApache-2.0633 7 21

mirage

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

Language:C++Apache-2.0627 13 51

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonNOASSERTION557 14 99

depyf

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.

Language:PythonMIT498 8 27

composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

Language:C++NOASSERTION310 25 226

Minitron

A family of compressed models obtained via pruning and knowledge distillation

flute

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Language:C++Apache-2.0185 5 9

gemlite

Simple and fast low-bit matmul kernels in CUDA / Triton

Language:PythonApache-2.0140 7 4

GPTQModel

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

Language:PythonApache-2.0121 3 66

TEAL

Language:PythonMIT95 4 9

cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.

Language:PythonBSD-3-Clause87 8 20

FLASHNN

Language:PythonApache-2.079 10 2

TensorRT-Incubator

Experimental projects related to TensorRT

Language:MLIR79 5 86

Sparse-Marlin

Boosting 4-bit inference kernels with 2:4 Sparsity

Language:CudaApache-2.051 6 1

compressed-tensors

A safetensors extension to efficiently store sparse quantized tensors on disk

Language:PythonApache-2.048 10 5

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonApache-2.043 40

mistral-evals

Language:Python23 20 1

torch_cgx

Pytorch distributed backend extension with compression support

Language:C++AGPL-3.016 4 5

SPP

[ICML 2024] SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

Language:Jupyter NotebookMIT16 10

py-codegen

Language:Python14 20