Longzhi Wang's repositories
act
Run your GitHub Actions locally 🚀
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Awesome-LLM-Compression
Awesome LLM compression research papers and tools.
Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Camp
飞桨护航计划集训营
DeepCache
[CVPR 2024] DeepCache: Accelerating Diffusion Models for Free
EAGLE
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
flash-attention
Fast and memory-efficient exact attention
gemma.cpp
lightweight, standalone C++ inference engine for Google's Gemma models.
gemma_pytorch
The official PyTorch implementation of Google's Gemma models
gligen-gui
An intuitive GUI for GLIGEN that uses ComfyUI in the backend
grok-1
Grok open release
KIVI
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
KVQuant
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
llm-awq
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
MegEngine
MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架
ml_dtypes
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
mlx
MLX: An array framework for Apple silicon
OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Open-Sora-Plan
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
PaddleNLP
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
triton
Development repository for the Triton language and compiler
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs