vLLM's repositories
llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
semantic-router
Intelligent Router for Mixture-of-Models
production-stack
vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
vllm-ascend
Community maintained hardware plugin for vLLM on Ascend
compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
tpu-inference
TPU inference for vLLM, with unified JAX and PyTorch support.
flash-attention
Fast and memory-efficient exact attention
speculators
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
vllm-spyre
Community maintained hardware plugin for vLLM on Spyre
vllm-gaudi
Community maintained hardware plugin for vLLM on Intel Gaudi
vllm-neuron
Community maintained hardware plugin for vLLM on AWS Neuron
vllm-xpu-kernels
The vLLM XPU kernels for Intel GPU
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling