llx's starred repositories
splitwise-sim
LLM serving cluster simulator
vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
KuiperLLama
动手实现大模型推理框架
ServerlessLLM
Cost-efficient and fast multi-LLM serving.
Triton-Puzzles
Puzzles for learning Triton
how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
CUDA-Learn-Notes
🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
googletest
GoogleTest - Google Testing and Mocking Framework
ring-flash-attention
Ring attention implementation with flash attention
AI-Software-Startups
A Survey of AI startups
ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
Awesome-RoadMaps-and-Interviews
Awesome Interviews for Coder, Programming Language, Software Engineering, Web, Backend, Distributed Infrastructure, DataScience & AI | 面试必备
Nsight-Compute-Docker-Image
Nsight Compute in Docker