wzh99

followers

following

stars

ByteDance

Shanghai, China

Organizations

SJTU-CSE

WANG Zihan's starred repositories

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonApache-2.027108 226 4522

mlc-llm

Universal LLM Deployment Engine with ML Compilation

Language:PythonApache-2.018682 170 1356

flash-attention

Fast and memory-efficient exact attention

Language:PythonBSD-3-Clause13466 114 1034

mamba

Mamba SSM architecture

Language:PythonApache-2.012624 102 504

LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".

Language:Python10044 154 59

FlexiGen

Running large language models on a single GPU for throughput-oriented scenarios.

Language:PythonApache-2.09137 111 81

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++Apache-2.08233 87 1801

gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Language:PythonBSD-3-Clause5529 63 98

sglang

SGLang is a fast serving framework for large language models and vision language models.

Language:PythonApache-2.05199 53 524

llm-numbers

Numbers every LLM developer should know

Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

GPL-3.02518 84 6

S-LoRA

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Language:PythonApache-2.01706 24 38

how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Language:Cuda1454 22 9

flashinfer

FlashInfer: Kernel Library for LLM Serving

Language:CudaApache-2.01153 16 100

punica

Serving multiple LoRA finetuned LLM as one

Language:PythonApache-2.0951 12 38

How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

Language:CudaApache-2.0804 13 15

LLMSys-PaperList

Large Language Model (LLM) Systems Paper List

Qwen-TensorRT-LLM

Language:PythonMIT572 6 115

rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Language:C++Apache-2.0517 12 86

ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

Language:C++Apache-2.0452 10 10

compiler-and-arch

A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture

glake

GLake: optimizing GPU memory management and IO transmission.

Language:PythonApache-2.0352 7 22

flux

A fast communication-overlapping library for tensor parallelism on GPUs.

Language:C++Apache-2.0191 7 21

flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Language:CudaApache-2.0167 5 5

triton-shared

Shared Middle-Layer for Triton Compilation

Language:MLIRMIT160 10 64

InfiniTensor

Language:C++Apache-2.0152 4 44

M3DM

Language:PythonMIT138 5 34

TileFlow

TileFlow is a performance analysis tool based on Timeloop for fusion dataflows

Language:C++MIT53 10

MAGIS

MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)

Language:PythonMIT37 2 2

APR

[IJCAI 23] APR: Online Distant Point Cloud Registration Through Aggregated Point Cloud Reconstruction

Language:PythonMIT10 1 1