qinyang-bao

Q7bao's starred repositories

gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Language:PythonApache-2.0184500

AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf

Language:PythonApache-2.0109100

unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Language:PythonMIT1939700

cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language:CudaMIT25400

GPTQ-triton

GPTQ inference Triton kernel

Language:Jupyter NotebookApache-2.027200

x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers

Language:PythonMIT451200

nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Language:PythonMIT3581100

WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath

Language:Python916500

FlexGen

Running large language models on a single GPU for throughput-oriented scenarios.

Language:PythonApache-2.0910800

LLM-QAT

Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"

Language:PythonNOASSERTION22500

examples

Fast and flexible reference benchmarks

Language:ShellApache-2.043200

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++Apache-2.0797600