DefTruth's repositories
lite.ai.toolkit
🛠 A lite C++ toolkit of awesome AI models, support ONNXRuntime, MNN. Contains YOLOv5, YOLOv6, YOLOX, YOLOR, FaceDet, HeadSeg, HeadPose, Matting etc. Engine: ONNXRuntime, MNN.
Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
CUDA-Learn-Note
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
statistic-learning-R-note
📒《统计学习方法-李航》学习笔记 200 页 PDF,各种手推公式细节讲解,包含详细的目录以及R语言代码实现,可结合《统计学习方法》提高学习效率,适合机器学习、深度学习初学者。
flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
Awesome-SD-Inference
Awesome Stable Diffusion Inference
torch-tensorrt
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
flash-attention
Fast and memory-efficient exact attention
LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
DeepCache
DeepCache: Accelerating Diffusion Models for Free
DiT
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
flash-linear-attention
Fast implementations of causal linear attention for autogressive language modeling (Pytorch)
flashinfer
FlashInfer: Kernel Library for LLM Serving
LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
stable-fast
Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
tensorrtllm_backend
The Triton TensorRT-LLM Backend
triton
Development repository for the Triton language and compiler
xformers
Hackable and optimized Transformers building blocks, supporting a composable construction.