Tom-CaoZH

Zhang Cao's starred repositories

long-context-attention

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Language:Python16000

ThunderKittens

Tile primitives for speedy kernels

Language:CudaMIT115400

nvbandwidth

A tool for bandwidth measurements on NVIDIA GPUs.

Language:C++Apache-2.022300

qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Language:PythonApache-2.026600

MAGIS

MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)

Language:PythonMIT2000

DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

MIT235700

touying

Touying is a powerful package for creating presentation slides in Typst.

Language:TypstMIT26400

Honeycomb

Component-Model Framework in C++

Language:C++NOASSERTION4400

foyer

Hybrid memory and disk cache in Rust

Language:RustApache-2.04500

nimble

New file format for storage of large columnar datasets.

Language:C++Apache-2.034500

SnapKV

Language:Python12700

Sequoia

scalable and robust tree-based speculative decoding algorithm

Language:Python26500

TriForce

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Language:Python12700

LLMSpeculativeSampling

Fast inference from large lauguage models via speculative decoding

Language:Python39600

SAS-Cache

[MSST '24] SAS-Cache: A Semantic-Aware Secondary Cache for LSM-based Key-Value Stores

MIT300

prophet-rocksdb

[MSST '24] Prophet: Optimizing LSM-Based Key-Value Store on ZNS SSDs with File Lifetime Prediction and Compaction Compensation.

Language:C++GPL-2.0400

DeepCache

[CVPR 2024] DeepCache: Accelerating Diffusion Models for Free

Language:PythonApache-2.063600

llama3

The official Meta Llama 3 GitHub site

Language:PythonNOASSERTION2137400

streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Language:PythonMIT628400

calm

CUDA/Metal accelerated language model inference

Language:CMIT31800

llm.c

LLM training in simple, raw C/CUDA

Language:CudaMIT2007200

CacheGen

Language:Python1600

fastmoe

A fast MoE impl for PyTorch

Language:PythonApache-2.0142100

lut-gemm

Language:C++Apache-2.01400

llamafile

Distribute and run LLMs with a single file.

Language:C++NOASSERTION1610300

MiniCPM

MiniCPM-2B: An end-side LLM outperforms Llama2-13B.

Language:Jupyter NotebookApache-2.0409500

libbf

:dart: Bloom filters for C++11

Language:C++BSD-3-Clause35200

FrozenHot

Language:PythonApache-2.02500

ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Language:PythonApache-2.0857700

Bamboo

Bamboo-7B Large Language Model

Apache-2.08500