DefTruth

DefTruth

User data from Github https://github.com/DefTruth

Company:@xlite-dev, @vipshop

Location:Guangzhou, China

Home Page:https://github.com/xlite-dev

GitHub:@DefTruth


Organizations
PaddlePaddle
vipshop
xlite-dev

DefTruth's repositories

CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Language:CudaLicense:GPL-3.0Stargazers:50Issues:0Issues:0

lite.ai.toolkit

🛠 A lite C++ toolkit: contains 100+ Awesome AI models, support MNN, NCNN, TNN, ONNXRuntime and TensorRT. 🎉🎉

Language:C++License:GPL-3.0Stargazers:21Issues:0Issues:0

Awesome-LLM-Inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

License:GPL-3.0Stargazers:13Issues:1Issues:0

hgemm-mma

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Language:CudaLicense:GPL-3.0Stargazers:5Issues:0Issues:0

triton

Development repository for the Triton language and compiler

Language:C++License:MITStargazers:4Issues:1Issues:0

ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.

Language:CudaLicense:GPL-3.0Stargazers:2Issues:0Issues:0

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++License:Apache-2.0Stargazers:2Issues:1Issues:0

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonLicense:NOASSERTIONStargazers:2Issues:0Issues:0

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:2Issues:1Issues:0

cutlass

CUDA Templates for Linear Algebra Subroutines

Language:C++License:NOASSERTIONStargazers:1Issues:1Issues:0

flash-attention

Fast and memory-efficient exact attention

Language:PythonLicense:BSD-3-ClauseStargazers:1Issues:1Issues:0

FlashMLA

FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs

Language:C++License:MITStargazers:1Issues:0Issues:0

InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Language:PythonLicense:MITStargazers:1Issues:0Issues:0

llm-action

本项目旨在分享大模型相关技术原理以及实战经验。

Language:HTMLLicense:Apache-2.0Stargazers:1Issues:0Issues:0

llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Language:PythonLicense:Apache-2.0Stargazers:1Issues:0Issues:0

MHA2MLA

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Language:PythonLicense:Apache-2.0Stargazers:1Issues:0Issues:0

MInference

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Language:PythonLicense:MITStargazers:1Issues:0Issues:0

Awesome-Video-Attention

A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and caching, etc.

Stargazers:0Issues:0Issues:0

cache-dit

🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers

Language:PythonLicense:NOASSERTIONStargazers:0Issues:0Issues:0

chain-of-draft

Code and data for the Chain-of-Draft (CoD) paper

Language:PythonStargazers:0Issues:0Issues:0

CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language:CudaLicense:MITStargazers:0Issues:0Issues:0

lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM

Language:PythonLicense:Apache-2.0Stargazers:0Issues:1Issues:0

ParaAttention

Context parallel attention that accelerates DiT model inference with dynamic caching

Language:PythonLicense:NOASSERTIONStargazers:0Issues:0Issues:0

sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Language:CudaLicense:Apache-2.0Stargazers:0Issues:0Issues:0

TensorRT

NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

Language:C++License:Apache-2.0Stargazers:0Issues:1Issues:0

unlock-deepseek

DeepSeek 系列工作解读、扩展和复现。

Language:PythonStargazers:0Issues:0Issues:0

xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0