LeeHX (ReKarma)

ReKarma

Geek Repo

Company:ByteDance Inc

Github PK Tool:Github PK Tool

LeeHX's starred repositories

flash_attn_jax

JAX bindings for Flash Attention v2

Language:C++License:BSD-3-ClauseStargazers:71Issues:0Issues:0

tree_attention

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Language:PythonStargazers:79Issues:0Issues:0

qwen.cpp

C++ implementation of Qwen-LM

Language:C++License:NOASSERTIONStargazers:528Issues:0Issues:0
Language:PythonLicense:MITStargazers:60Issues:0Issues:0

Odysseus-Transformer

Odysseus: Playground of LLM Sequence Parallelism

Language:PythonStargazers:49Issues:0Issues:0
Language:CudaStargazers:69Issues:0Issues:0

nccl-tests

NCCL Tests

Language:CudaLicense:BSD-3-ClauseStargazers:784Issues:0Issues:0

long-context-attention

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Language:PythonStargazers:260Issues:0Issues:0

grouped_gemm

PyTorch bindings for CUTLASS grouped GEMM.

Language:CudaLicense:Apache-2.0Stargazers:45Issues:0Issues:0

Burst-Attention

Distributed IO-aware Attention algorithm

Language:PythonLicense:Apache-2.0Stargazers:16Issues:0Issues:0

LivePortrait

Bring portraits to life!

Language:PythonLicense:NOASSERTIONStargazers:10037Issues:0Issues:0

Guide-NVIDIA-Tools

NVIDIA tools guide

Language:CudaStargazers:57Issues:0Issues:0

EasyContext

Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.

Language:PythonLicense:Apache-2.0Stargazers:577Issues:0Issues:0
Language:PythonLicense:Apache-2.0Stargazers:33Issues:0Issues:0

Adam-mini

Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793

Language:PythonStargazers:253Issues:0Issues:0
Language:PythonLicense:Apache-2.0Stargazers:1142Issues:0Issues:0

matmulfreellm

Implementation for MatMul-free LM.

Language:PythonLicense:Apache-2.0Stargazers:2805Issues:0Issues:0

FlagGems

FlagGems is an operator library for large language models implemented in Triton Language.

Language:PythonLicense:Apache-2.0Stargazers:199Issues:0Issues:0

xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

Language:PythonLicense:Apache-2.0Stargazers:313Issues:0Issues:0

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonLicense:NOASSERTIONStargazers:378Issues:0Issues:0

torchtitan

A native PyTorch Library for large model training

Language:PythonLicense:BSD-3-ClauseStargazers:1473Issues:0Issues:0

gpu-optimization-workshop

Slides, notes, and materials for the workshop

Stargazers:292Issues:0Issues:0

Awesome-Triton-Kernels

Collection of kernels written in Triton language

License:MITStargazers:29Issues:0Issues:0
Language:Jupyter NotebookStargazers:80Issues:0Issues:0

qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Language:PythonLicense:Apache-2.0Stargazers:374Issues:0Issues:0

ring-attention

ring-attention experiments

Language:PythonLicense:Apache-2.0Stargazers:84Issues:0Issues:0

fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Language:CudaLicense:Apache-2.0Stargazers:164Issues:0Issues:0

ring-flash-attention

Ring attention implementation with flash attention

Language:PythonStargazers:483Issues:0Issues:0

cuda-checkpoint

CUDA checkpoint and restore utility

Language:CudaLicense:NOASSERTIONStargazers:182Issues:0Issues:0

torch-cublas-hgemm

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Language:CudaStargazers:20Issues:0Issues:0