WANG Zihan's starred repositories
flash-attention
Fast and memory-efficient exact attention
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
llm-numbers
Numbers every LLM developer should know
Awesome-LLM-Inference
đź“–A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
flashinfer
FlashInfer: Kernel Library for LLM Serving
How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
LLMSys-PaperList
Large Language Model (LLM) Systems Paper List
ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
compiler-and-arch
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
triton-shared
Shared Middle-Layer for Triton Compilation