jeromeku's repositories
accelerated-scan
Accelerated First Order Parallel Associative Scan
ao
torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch.
api-design
LivingSocial API Design Guide
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
candle
Minimalist ML framework for Rust
colab-connect
Connect to Google Colab VM from your local VSCode
cookbook-dev
Deep learning for dummies. All the practical details and useful utilities that go into working with real models.
cutlass
CUDA Templates for Linear Algebra Subroutines
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
EVT_AE
Artifacts of EVT ASPLOS'24
FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
fsdp_qlora
Training LLMs with QLoRA + FSDP
GEMM_MMA
Optimize GEMM with tensorcore step by step
gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
long-context-attention
USP: Hybrid Sequence Parallel Attention for Long Context Transformers Model Training and Inference
punica
Serving multiple LoRA finetuned LLM as one
sc23-dl-tutorial
SC23 Deep Learning at Scale Tutorial Material
stable-fast
Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
torchtune
A Native-PyTorch Library for LLM Fine-tuning
transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
triton
Development repository for the Triton language and compiler
unsloth
5X faster 60% less memory QLoRA finetuning