Wei Huang's starred repositories

InternLM

Official release of InternLM2.5 base and chat models. 1M context support

Language:PythonLicense:Apache-2.0Stargazers:6217Issues:0Issues:0

arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:413Issues:0Issues:0

TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications

Language:PythonLicense:MITStargazers:355Issues:0Issues:0

FlagAttention

A collection of memory efficient attention operators implemented in the Triton language.

Language:PythonLicense:NOASSERTIONStargazers:203Issues:0Issues:0

storm

An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.

Language:PythonLicense:MITStargazers:10115Issues:0Issues:0

RAG-Retrieval

Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT,Cross Encoder

Language:PythonLicense:MITStargazers:424Issues:0Issues:0
Language:CudaStargazers:2091Issues:0Issues:0

FasterTransformer

Transformer related optimization, including BERT, GPT

Language:C++License:Apache-2.0Stargazers:5769Issues:0Issues:0

opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Language:PythonLicense:Apache-2.0Stargazers:3722Issues:0Issues:0

qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Language:PythonLicense:Apache-2.0Stargazers:394Issues:0Issues:0

calm

CUDA/Metal accelerated language model inference

Language:CLicense:MITStargazers:363Issues:0Issues:0

dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.

Language:PythonLicense:Apache-2.0Stargazers:13351Issues:0Issues:0

Liger-Kernel

Efficient Triton Kernels for LLM Training

Language:PythonLicense:BSD-2-ClauseStargazers:2870Issues:0Issues:0

EQ-Bench

A benchmark for emotional intelligence in large language models

Language:PythonLicense:MITStargazers:175Issues:0Issues:0

OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Language:PythonLicense:MITStargazers:668Issues:0Issues:0
Language:CudaLicense:MITStargazers:46Issues:0Issues:0

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++License:Apache-2.0Stargazers:8146Issues:0Issues:0

flashinfer

FlashInfer: Kernel Library for LLM Serving

Language:CudaLicense:Apache-2.0Stargazers:1125Issues:0Issues:0

QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Language:PythonLicense:Apache-2.0Stargazers:247Issues:0Issues:0

any-precision-llm

[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Language:PythonLicense:MITStargazers:69Issues:0Issues:0

T-MAC

Low-bit LLM inference on CPU with lookup table

Language:C++License:MITStargazers:415Issues:0Issues:0

TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

Language:C++License:MITStargazers:694Issues:0Issues:0
Language:PythonStargazers:74Issues:0Issues:0

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Language:PythonLicense:Apache-2.0Stargazers:555Issues:0Issues:0

Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Language:CudaStargazers:256Issues:0Issues:0

smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Language:PythonLicense:MITStargazers:1174Issues:0Issues:0

GPTQ-triton

GPTQ inference Triton kernel

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:272Issues:0Issues:0

Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.

Language:ShellStargazers:7371Issues:0Issues:0

AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Language:PythonLicense:MITStargazers:1621Issues:0Issues:0

llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Language:PythonLicense:MITStargazers:2325Issues:0Issues:0