Zhuobin Huang's starred repositories
Self-Hosting-Guide
Self-Hosting Guide. Learn all about locally hosting (on premises & private web servers) and managing software applications by yourself or your organization. Including Cloud, LLMs, WireGuard, Automation, Home Assistant, and Networking.
generative-recommenders
Repository hosting code used to reproduce results in "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).
cudaparsers
Parsers for CUDA binary files
Awesome-LLM-Strawberry
A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 and reasoning techniques.
quiet-star
Code for Quiet-STaR
k8s-dra-driver
Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
sgl-learning-materials
Materials for learning SGLang
gpumembench
A GPU benchmark suite for assessing on-chip GPU memory bandwidth
Liger-Kernel
Efficient Triton Kernels for LLM Training
rocm_bandwidth_test
Bandwidth test for ROCm
Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!
torchdynamo
A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
MIT-6.5940
All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 • Fall • 2023 • https://efficientml.ai
MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.