Beast code in Giters

Sheng Qin's starred repositories

ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Language:PythonApache-2.025200

flux

Official inference repo for FLUX.1 models

Language:PythonApache-2.0521100

Qwen-TensorRT-LLM

Language:PythonMIT55200

evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Language:PythonApache-2.013300

LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Language:PythonMIT24500

sglang

SGLang is yet another fast serving framework for large language models and vision language models.

Language:PythonApache-2.0396300

LLMRoofline

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Language:Jupyter Notebook6300

neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

Language:PythonApache-2.0210700

AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Language:PythonMIT420600

GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

Language:PythonApache-2.0296200

AutoFP8

Language:PythonApache-2.011700

bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

Language:PythonMIT585900

AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Language:PythonMIT153300

llm-numbers

Numbers every LLM developer should know

402300

composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

Language:C++NOASSERTION26800

ThunderKittens

Tile primitives for speedy kernels

Language:CudaMIT143300

xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

Language:PythonApache-2.019800

Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

95900

long-context-attention

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Language:Python24600

minisora

MiniSora: A community aims to explore the implementation path and future development direction of Sora.

Language:PythonApache-2.0113500

Latte

Latte: Latent Diffusion Transformer for Video Generation.

Language:PythonApache-2.0156300

Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

Language:PythonMIT1108800

PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Language:PythonAGPL-3.0153400

QPyTorch

Low Precision Arithmetic Simulation in PyTorch

Language:PythonMIT25800

calculon

Language:PythonApache-2.08800

brevitas

Brevitas: neural network quantization in PyTorch

Language:PythonNOASSERTION113300

fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Language:CudaApache-2.016400

cccl

CUDA Core Compute Libraries

Language:C++NOASSERTION104600

Zelda64Recomp

Static recompilation of Majora's Mask (and soon Ocarina of Time) for PC (Windows/Linux)

Language:CGPL-3.0514800

N64Recomp

Tool to statically recompile N64 games into native executables

Language:C++MIT621400