whutbd's repositories
cuda-learn-note
🎉CUDA 笔记 / 高频面试题汇总 / C++笔记,个人笔记,更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Cpp-Templates-2ed
C++11/14/17/20 templates and generic programming, the most complex and difficult technical details of C++, indispensable in building infrastructure libraries.
byteps
A high performance and generic framework for distributed DNN training
ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
cmake-demo
《CMake入门实战》源码
CMakeTutorial
CMake中文实战教程
core
The core library and APIs implementing the Triton Inference Server.
CTranslate2
Fast inference engine for Transformer models
FasterTransformer
Transformer related optimization, including BERT, GPT
fastllm
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
flashinfer
FlashInfer: Kernel Library for LLM Serving
fun-rec
推荐系统入门教程,在线阅读地址:https://datawhalechina.github.io/fun-rec/
graph-learn
An Industrial Graph Neural Network Framework
How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
llm.c
LLM training in simple, raw C/CUDA
onnx-modifier
A tool to modify ONNX models in a visualization fashion, based on Netron and Flask.
onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
pytorch-diffusion
pytorch复现stable diffusion
pytorch-transformer
pytorch复现transformer
PytorchOCR
基于Pytorch的OCR工具库,支持常用的文字检测和识别算法
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
seamless_communication
Foundational Models for State-of-the-Art Speech and Text Translation
sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
SimpleGPUHashTable
A simple GPU hash table implemented in CUDA using lock free techniques
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs