ReKarma

followers

following

stars

ByteDance Inc

LeeHX's starred repositories

flash_attn_jax

JAX bindings for Flash Attention v2

Language:C++BSD-3-Clause7100

tree_attention

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Language:Python7900

qwen.cpp

C++ implementation of Qwen-LM

Language:C++NOASSERTION52800

DiTFastAttn

Language:PythonMIT6000

Odysseus-Transformer

Odysseus: Playground of LLM Sequence Parallelism

Language:Python4900

DissectingTensorCores

Language:Cuda6900

nccl-tests

NCCL Tests

Language:CudaBSD-3-Clause78400

long-context-attention

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Language:Python26000

grouped_gemm

PyTorch bindings for CUTLASS grouped GEMM.

Language:CudaApache-2.04500

Burst-Attention

Distributed IO-aware Attention algorithm

Language:PythonApache-2.01600

LivePortrait

Bring portraits to life!

Language:PythonNOASSERTION1003700

Guide-NVIDIA-Tools

NVIDIA tools guide

Language:Cuda5700

EasyContext

Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.

Language:PythonApache-2.057700

striped_attention

Language:PythonApache-2.03300

Adam-mini

Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793

Language:Python25300

megablocks

Language:PythonApache-2.0114200

matmulfreellm

Implementation for MatMul-free LM.

Language:PythonApache-2.0280500

FlagGems

FlagGems is an operator library for large language models implemented in Triton Language.

Language:PythonApache-2.019900

xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

Language:PythonApache-2.031300

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonNOASSERTION37800

torchtitan

A native PyTorch Library for large model training

Language:PythonBSD-3-Clause147300

gpu-optimization-workshop

Slides, notes, and materials for the workshop

Awesome-Triton-Kernels

Collection of kernels written in Triton language

MIT2900

inference-optimization-blog-post

Language:Jupyter Notebook8000

qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Language:PythonApache-2.037400

ring-attention

ring-attention experiments

Language:PythonApache-2.08400

fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Language:CudaApache-2.016400

ring-flash-attention

Ring attention implementation with flash attention

Language:Python48300

cuda-checkpoint

CUDA checkpoint and restore utility

Language:CudaNOASSERTION18200

torch-cublas-hgemm

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Language:Cuda2000