Uranus's starred repositories
Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
ThunderKittens
Tile primitives for speedy kernels
MInference
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
ringattention
Transformers with Arbitrarily Large Context
ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
AI-Software-Startups
A Survey of AI startups
ttt-lm-jax
Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
sarathi-serve
A low-latency & high-throughput serving engine for LLMs
vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
SwiftTransformer
High performance Transformer implementation in C++.
Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk