Wei's repositories
AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
BitNet
Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch
Book-Mathematical-Foundation-of-Reinforcement-Learning
This is the homepage of a new book entitled "Mathematical Foundations of Reinforcement Learning."
clover
Official Implementation of Clover-1 and Clover-2
cs-self-learning
计算机自学指南
EAGLE
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation
EfficientQAT
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
grub2-bios-uefi-usb
Create a USB boot drive with support for legacy BIOS and 32/64bit UEFI in a single partition on Linux
marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
matmulfreellm
Implementation for MatMul-free LM.
MCSD
Multi-Candidate Speculative Decoding
Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
Sequoia
scalable and robust tree-based speculative decoding algorithm
ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
SpeculativeDecodingPapers
📰 Must-read papers and blogs on Speculative Decoding ⚡️
surya
OCR, layout analysis, reading order, line detection in 90+ languages
tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
tiny-gpu
A minimal GPU design in Verilog to learn how GPUs work from the ground up
tinyllama-bitnet
Train your own small bitnet model
tvm_mlir_learn
compiler learning resources collect.