pengcuo's repositories
onnx-simplifier
Simplify your onnx model
Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
ChatPaper
Use ChatGPT to summarize the arXiv papers.
ColossalAI
Making big AI models cheaper, easier, and scalable
dbg-macro
A dbg(…) macro for C++
excelPanel
An Android's two-dimensional RecyclerView. Not only can load historical data, but also can load future data.
FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
flash-attention
Fast and memory-efficient exact attention
Llama-Chinese
Llama中文社区,Llama3在线体验和微调模型已开放,实时汇总最新Llama3学习资料,已将所有代码更新适配Llama3,构建最好的中文Llama大模型,完全开源可商用
llama-recipes
Scripts for fine-tuning Llama2 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization & question answering. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment.Demo apps to showcase Llama2 for WhatsApp & Messenger
llm_interview_note
主要记录大语言大模型(LLMs) 算法(应用)工程师相关的知识及面试题
LoRA
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
MQBench
Model Quantization Benchmark
namegpt
Generate unique and creative project names in seconds with AI!
onnx-modifier
A tool to modify ONNX models in a visualization fashion, based on Netron and Flask.
ppq
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
prajna
a program language for AI infrastructure
PyTorch_YOLOv1
A new version of YOLOv1
sglang
SGLang is yet another fast serving framework for large language models and vision language models.
TensorRT
NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
triton
Development repository for the Triton language and compiler
United-Perception
United Perception
unsloth
Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memory