BoruiXu

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

GPL-3.0166700

mlc-llm

Universal LLM Deployment Engine with ML Compilation

Language:PythonApache-2.01745000

mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops

Language:PythonMIT226100

Edge-MoE

Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts

Language:C++7300

Anima

33B Chinese LLM, DPO QLORA, 100K context, AirLLM 70B inference with single 4GB GPU

Language:Jupyter NotebookApache-2.0339600

LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Language:PythonMIT19500

lightllm

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Language:PythonApache-2.0193800

streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Language:PythonMIT629900

intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Language:PythonApache-2.0200300

DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Language:PythonApache-2.0170700

FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

Language:C++Apache-2.0156200

lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Language:PythonApache-2.0288200

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++Apache-2.0707500

BoruiXu

Borui Xu's starred repositories

self-llm

word-GPT-Plus

Chinese-LLaMA-Alpaca

Atom

mediapipe

qserve

GEAR

calm

fastllm

flash-llm

inferflow

Awesome-LLM-Inference

mlc-llm

mixtral-offloading

Edge-MoE

Anima

LLM-Viewer

lightllm

streaming-llm

intel-extension-for-transformers

DeepSpeed-MII

FlexFlow

DejaVu

lmdeploy

TensorRT-LLM

FlexGen

perf-book

llama3

xdna-driver

Vitis-AI