Xu-Chen

followers

following

stars

Xu-Chen's starred repositories

flute

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Language:CudaApache-2.01800

flashinfer

FlashInfer: Kernel Library for LLM Serving

Language:CudaApache-2.082100

Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Language:Cuda9300

TLLM_QMM

TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.

Language:C++Apache-2.0900

MobileLLM

MobileLLM Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In ICML 2024.

Language:PythonNOASSERTION78400

llm-compressor

Language:PythonApache-2.06600

GPTModels.nvim

GPTModels - a multi model, window based LLM AI plugin for neovim, with an emphasis on stability and clean code

Language:LuaMIT3400

GPTQModel

An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).

Language:PythonApache-2.02700

QQQ

QQQ is an innovative and hardware-optimized W4A8 quantization solution.

Language:Python3100

ocular

AI Powered Search and Chat for Orgs - Think ChatGPT meets Google Search but powered by your data.

Language:TypeScriptNOASSERTION43100

mistral-inference

Official inference library for Mistral models

Language:Jupyter NotebookApache-2.0927900

farfalle

🔍 AI search engine - self-host with local or cloud LLMs

Language:TypeScriptApache-2.0232300

Sparse-Marlin

Language:CudaApache-2.01600

EAGLE

Official Implementation of EAGLE-1 and EAGLE-2

Language:PythonApache-2.067100

aspoem

Learn Chinese Poetry With AsPoem.com

Language:TypeScriptAGPL-3.0241500

DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

MIT307500

Perplexica

Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI

Language:TypeScriptMIT1123500

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Language:PythonApache-2.046300

Eurus

Language:PythonApache-2.024700

ollama

Get up and running with Llama 3, Mistral, Gemma 2, and other large language models.

Language:GoMIT7936500

clarity-ai

A simple Perplexity AI clone.

Language:TypeScriptMIT111200

aphrodite-engine

PygmalionAI's large-scale inference engine

Language:PythonAGPL-3.080200

exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

Language:PythonMIT327500

AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Language:PythonMIT145900

nm-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonNOASSERTION24100

AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Language:PythonMIT413500

Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.

Language:Shell637000

llamafia.github

Language:PythonApache-2.030500

gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Language:PythonBSD-3-Clause537200

openai-scala-client

Scala client for OpenAI API

Language:ScalaMIT16900