Beast code in Giters

LiuXinyu's starred repositories

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language:PythonNOASSERTION53600

QQQ

QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.

Language:Python7500

tensorrt_backend

The Triton backend for TensorRT.

Language:C++BSD-3-Clause6200

pytorch_backend

The Triton backend for the PyTorch TorchScript models.

Language:C++BSD-3-Clause12300

blog_os

Writing an OS in Rust

Language:HTMLApache-2.01578200

llm.c

LLM training in simple, raw C/CUDA

Language:CudaMIT2435400

DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

MIT357500

unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory

Language:PythonApache-2.01793400

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Language:PythonApache-2.0195500

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Language:PythonApache-2.01206400

flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton

Language:PythonMIT132400

easy_rust

Rust explained using easy English

Language:ShellMIT809700

flash-attention

Fast and memory-efficient exact attention

Language:PythonBSD-3-Clause13800

OLMo

Modeling, training, eval, and inference code for OLMo

Language:PythonApache-2.0461100

Programming_Massively_Parallel_Processors

CUDA 6大并行计算模式代码与笔记

Language:Cuda5800

tensorrtllm_backend

The Triton TensorRT-LLM Backend

Language:PythonApache-2.070300

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonApache-2.02984600

micrograd

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

Language:Jupyter NotebookMIT1044900

LoRA

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

Language:PythonMIT1068700

pytest-benchmark

pytest fixture for benchmarking code

Language:PythonBSD-2-Clause125200

PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

Language:C++MIT795400

LLM-QAT

Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"

Language:PythonNOASSERTION25300

mistral-inference

Official inference library for Mistral models

Language:Jupyter NotebookApache-2.0970700

Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Language:Jupyter NotebookApache-2.0229700

skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Language:PythonApache-2.0675700

database-system-readings

:yum: A curated reading list about database systems

MIT46600

transformer_vq

Language:Python17900

magic-animate

[CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Language:PythonBSD-3-Clause1047000

LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Language:PythonApache-2.02014700

gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Language:PythonApache-2.0192300