LeeHX's starred repositories
flash_attn_jax
JAX bindings for Flash Attention v2
tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
nccl-tests
NCCL Tests
long-context-attention
Sequence Parallel Attention for Long Context LLM Model Training and Inference
grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
Burst-Attention
Distributed IO-aware Attention algorithm
LivePortrait
Bring portraits to life!
Guide-NVIDIA-Tools
NVIDIA tools guide
EasyContext
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
matmulfreellm
Implementation for MatMul-free LM.
TensorRT-Model-Optimizer
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
torchtitan
A native PyTorch Library for large model training
gpu-optimization-workshop
Slides, notes, and materials for the workshop
Awesome-Triton-Kernels
Collection of kernels written in Triton language
ring-attention
ring-attention experiments
ring-flash-attention
Ring attention implementation with flash attention
cuda-checkpoint
CUDA checkpoint and restore utility
torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu