DefTruth

DefTruth's repositories

CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Language:CudaGPL-3.05000

lite.ai.toolkit

🛠 A lite C++ toolkit: contains 100+ Awesome AI models, support MNN, NCNN, TNN, ONNXRuntime and TensorRT. 🎉🎉

Language:C++GPL-3.02100

Awesome-LLM-Inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

GPL-3.013 10

hgemm-mma

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Language:CudaGPL-3.0500

triton

Development repository for the Triton language and compiler

Language:C++MIT4 10

ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.

Language:CudaGPL-3.0200

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++Apache-2.02 10

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonNOASSERTION200

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonApache-2.02 10

cutlass

CUDA Templates for Linear Algebra Subroutines

Language:C++NOASSERTION1 10

flash-attention

Fast and memory-efficient exact attention

Language:PythonBSD-3-Clause1 10

FlashMLA

FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs

Language:C++MIT100

InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Language:PythonMIT100

llm-action

本项目旨在分享大模型相关技术原理以及实战经验。

Language:HTMLApache-2.0100

llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Language:PythonApache-2.0100

MHA2MLA

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Language:PythonApache-2.0100

MInference

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonMIT100

Awesome-Video-Attention

A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and caching, etc.

000

cache-dit

🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers

Language:PythonNOASSERTION000

chain-of-draft

Code and data for the Chain-of-Draft (CoD) paper

Language:Python000

CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Language:PythonApache-2.0000

cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language:CudaMIT000

lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM

Language:PythonApache-2.0010

ParaAttention

Context parallel attention that accelerates DiT model inference with dynamic caching

Language:PythonNOASSERTION000

sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

Language:PythonApache-2.0000

SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Language:CudaApache-2.0000

TensorRT

NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

Language:C++Apache-2.0010

unlock-deepseek

DeepSeek 系列工作解读、扩展和复现。

Language:Python000

xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Language:PythonApache-2.0000