Qubitium-ModelCloud's repositories
alpaca-lora
Instruct-tune LLaMA on consumer hardware
flash-attention
Fast and memory-efficient exact attention
flashinfer
FlashInfer: Kernel Library for LLM Serving
gemma_pytorch
The official PyTorch implementation of Google's Gemma models
lm-format-enforcer
Enforce the output format (JSON Schema, Regex etc) of a language model
sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
accelerate
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
auto-round
SOTA Weight-only Quantization Algorithm for LLMs
AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
clod-code
rot13 version of claw code
evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
FastChat
The release repo for "Vicuna: An Open Chatbot Impressing GPT-4"
GPTQ-for-LLaMa
4 bits quantization of LLaMa using GPTQ
GPTQ-triton
GPTQ inference Triton kernel
GPTQModel
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
hqq
Official implementation of Half-Quadratic Quantization (HQQ)
hyperDB
A hyper-fast local vector database for use with LLM Agents. Now accepting SAFEs at $35M cap.
llama.cpp
Port of Facebook's LLaMA model in C/C++
mav
model activation visualiser
pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
qlora
QLoRA: Efficient Finetuning of Quantized LLMs
QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
the-algorithm
Source code for Twitter's Recommendation Algorithm
transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
unsloth
5X faster 60% less memory QLoRA finetuning
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs