qingquansong

Qingquan Song's starred repositories

unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

Language:PythonApache-2.016642 117 875

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Language:PythonApache-2.012493 133 210

Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

Language:PythonMIT11309 160 303

Megatron-LM

Ongoing research training transformer models at scale

Language:PythonNOASSERTION10210 162 735

lm-evaluation-harness

A framework for few-shot evaluation of language models.

Language:PythonMIT6644 37 1097

bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

Language:PythonMIT5756 48 968

torchtune

PyTorch native finetuning library

Language:PythonBSD-3-Clause4101 46 593

lectures

Material for cuda-mode lectures

Language:Jupyter NotebookApache-2.02487 35 7

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Language:PythonApache-2.01864 34 325

cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Language:PythonApache-2.01716 21 66

ThunderKittens

Tile primitives for speedy kernels

Language:CudaMIT1525 25 26

alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Language:Jupyter NotebookApache-2.01466 7 142

GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Language:PythonApache-2.01384 18 52

ao

PyTorch native quantization and sparsity for training and inference

Language:PythonBSD-3-Clause1321 40 232

flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton

Language:PythonMIT1252 27 44

lovely-tensors

Tensors, for human consumption

Language:Jupyter NotebookMIT1104 10 22

resource-stream

CUDA related news and material links

MIT1100 37 2

LOMO

LOMO: LOw-Memory Optimization

Language:PythonMIT976 13 70

generative-recommenders

Repository hosting code used to reproduce results in "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).

Language:PythonApache-2.0676 24 44

hqq

Official implementation of Half-Quadratic Quantization (HQQ)

Language:PythonApache-2.0674 16 96

llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Language:PythonApache-2.0548 12 62

ring-flash-attention

Ring attention implementation with flash attention

Language:PythonMIT548 10 32

NeMo-Aligner

Scalable toolkit for efficient model alignment

Language:PythonApache-2.0542 16 71

orpo

Official repository for ORPO

Language:PythonApache-2.0414 6 27

Awesome-Generative-RecSys

A curated list of Generative Recommender Systems (Paper & Code)

359 12 2

Adam-mini

Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793

Language:Python297 10 24

optimizers

For optimization algorithm research and development.

Language:PythonNOASSERTION255 14 12

triton-index

Cataloging released Triton kernels.

Apache-2.0113 40

inference-optimization-blog-post

Language:Jupyter Notebook82 10

QuantEase

QuantEase, a layer-wise quantization framework, frames the problem as discrete-structured non-convex optimization. Our work leverages Coordinate Descent techniques, offering high-quality solutions without the need for matrix inversion or decomposition.

Language:PythonBSD-2-Clause17 70