BBuf

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

Language:CudaApache-2.02 10

tokenizers-cpp

Universal cross-platform tokenizers binding to HF and sentencepiece

Language:C++Apache-2.02 10

FasterTransformer

Transformer related optimization, including BERT, GPT

Language:C++Apache-2.01 10

LLaMA-Factory

Easy-to-use LLM fine-tuning framework (LLaMA, BLOOM, Mistral, Baichuan, Qwen, ChatGLM)

Language:PythonApache-2.01 10

nndeploy

nndeploy是一款模型端到端部署框架。以多端推理以及基于有向无环图模型部署为内核，致力为用户提供跨平台、简单易用、高性能的模型部署体验。

Language:C++Apache-2.0100

transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Language:PythonApache-2.01 10

accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Language:PythonApache-2.0000

ChatRWKV

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source.

Language:PythonApache-2.0010

deepseekv2-profile

Language:Jupyter Notebook000

fastllm

纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行

Language:C++010

kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

Language:HTMLNOASSERTION000

lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.

Language:PythonMIT010

RWKV-CUDA

The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )

Language:Cuda010

tvm_gpu_gemm

play gemm with tvm

Language:Cuda010

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache-2.0000

Xiaoyu Zhang's repositories

tvm_mlir_learn

how-to-optim-algorithm-in-cuda

Image-processing-algorithm

how-to-learn-deep-learning-framework

giantpandacv.com

ArmNeonOptimization

RWKV-World-HF-Tokenizer

flash-rwkv

run-rwkv-world-4-in-mlc-llm

megatron-lm-parallel-group-playground

oneflow-cifar

mlc-llm-code-analysis

trl

opencompass