Beast code in Giters

jeromeku's repositories

triton-rs

Language:Rust9 10

unpack_int4

1 10

accelerated-scan

Accelerated First Order Parallel Associative Scan

Language:CudaMIT000

ao

torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch.

Language:PythonBSD-3-Clause000

api-design

LivingSocial API Design Guide

000

AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Language:PythonMIT000

candle

Minimalist ML framework for Rust

Language:RustApache-2.0000

colab-connect

Connect to Google Colab VM from your local VSCode

Language:PythonMIT000

colab-test

Language:Jupyter Notebook010

cookbook-dev

Deep learning for dummies. All the practical details and useful utilities that go into working with real models.

Language:PythonApache-2.0000

cutlass

CUDA Templates for Linear Algebra Subroutines

Language:C++NOASSERTION000

CutlassProgramming

Language:Cuda000

DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Apache-2.0000

EVT_AE

Artifacts of EVT ASPLOS'24

Language:Python000

extension_builder

Language:Cuda000

FlagAttention

A collection of memory efficient attention operators implemented in the Triton language.

Language:PythonNOASSERTION000

fsdp_qlora

Training LLMs with QLoRA + FSDP

Language:Jupyter NotebookApache-2.0000

GaLore

Language:PythonApache-2.0000

GEMM_MMA

Optimize GEMM with tensorcore step by step

000

gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

BSD-3-Clause000

long-context-attention

USP: Hybrid Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Apache-2.0000

punica

Serving multiple LoRA finetuned LLM as one

Language:Cuda000

pybind_example

Language:PythonNOASSERTION010

sc23-dl-tutorial

SC23 Deep Learning at Scale Tutorial Material

Language:Python000

stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

Language:PythonMIT000

torchtune

A Native-PyTorch Library for LLM Fine-tuning

BSD-3-Clause000

transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Language:PythonApache-2.0000

triton

Development repository for the Triton language and compiler

Language:C++MIT000

triton-aot

Language:C++MIT010

unsloth

5X faster 60% less memory QLoRA finetuning

Language:PythonApache-2.0000