WANG Zihan (wzh99)

wzh99

Geek Repo

Company:ByteDance

Location:Shanghai, China

Github PK Tool:Github PK Tool


Organizations
SJTU-CSE

WANG Zihan's starred repositories

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:27108Issues:226Issues:4522

mlc-llm

Universal LLM Deployment Engine with ML Compilation

Language:PythonLicense:Apache-2.0Stargazers:18682Issues:170Issues:1356

flash-attention

Fast and memory-efficient exact attention

Language:PythonLicense:BSD-3-ClauseStargazers:13466Issues:114Issues:1034

mamba

Mamba SSM architecture

Language:PythonLicense:Apache-2.0Stargazers:12624Issues:102Issues:504

LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".

FlexiGen

Running large language models on a single GPU for throughput-oriented scenarios.

Language:PythonLicense:Apache-2.0Stargazers:9137Issues:111Issues:81

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++License:Apache-2.0Stargazers:8233Issues:87Issues:1801

gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Language:PythonLicense:BSD-3-ClauseStargazers:5529Issues:63Issues:98

sglang

SGLang is a fast serving framework for large language models and vision language models.

Language:PythonLicense:Apache-2.0Stargazers:5199Issues:53Issues:524

llm-numbers

Numbers every LLM developer should know

Awesome-LLM-Inference

đź“–A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

S-LoRA

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Language:PythonLicense:Apache-2.0Stargazers:1706Issues:24Issues:38

how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

flashinfer

FlashInfer: Kernel Library for LLM Serving

Language:CudaLicense:Apache-2.0Stargazers:1153Issues:16Issues:100

punica

Serving multiple LoRA finetuned LLM as one

Language:PythonLicense:Apache-2.0Stargazers:951Issues:12Issues:38

How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

Language:CudaLicense:Apache-2.0Stargazers:804Issues:13Issues:15

LLMSys-PaperList

Large Language Model (LLM) Systems Paper List

rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Language:C++License:Apache-2.0Stargazers:517Issues:12Issues:86

ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

Language:C++License:Apache-2.0Stargazers:452Issues:10Issues:10

compiler-and-arch

A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture

glake

GLake: optimizing GPU memory management and IO transmission.

Language:PythonLicense:Apache-2.0Stargazers:352Issues:7Issues:22

flux

A fast communication-overlapping library for tensor parallelism on GPUs.

Language:C++License:Apache-2.0Stargazers:191Issues:7Issues:21

flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Language:CudaLicense:Apache-2.0Stargazers:167Issues:5Issues:5

triton-shared

Shared Middle-Layer for Triton Compilation

Language:MLIRLicense:MITStargazers:160Issues:10Issues:64
Language:C++License:Apache-2.0Stargazers:152Issues:4Issues:44
Language:PythonLicense:MITStargazers:138Issues:5Issues:34

TileFlow

TileFlow is a performance analysis tool based on Timeloop for fusion dataflows

Language:C++License:MITStargazers:53Issues:1Issues:0

MAGIS

MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)

Language:PythonLicense:MITStargazers:37Issues:2Issues:2

APR

[IJCAI 23] APR: Online Distant Point Cloud Registration Through Aggregated Point Cloud Reconstruction

Language:PythonLicense:MITStargazers:10Issues:1Issues:1