DefTruth's repositories
CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
lite.ai.toolkit
🛠 A lite C++ toolkit: contains 100+ Awesome AI models, support MNN, NCNN, TNN, ONNXRuntime and TensorRT. 🎉🎉
Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
TensorRT-Model-Optimizer
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
flash-attention
Fast and memory-efficient exact attention
llm-action
本项目旨在分享大模型相关技术原理以及实战经验。
llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
MInference
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
Awesome-Video-Attention
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and caching, etc.
cache-dit
🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers
chain-of-draft
Code and data for the Chain-of-Draft (CoD) paper
CogVideo
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
ParaAttention
Context parallel attention that accelerates DiT model inference with dynamic caching
sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
SpargeAttn
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
unlock-deepseek
DeepSeek 系列工作解读、扩展和复现。
xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism