Beast code in Giters

whutbd's repositories

cuda-learn-note

🎉CUDA 笔记 / 高频面试题汇总 / C++笔记，个人笔记，更新随缘: sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.

Language:CudaGPL-3.0200

Cpp-Templates-2ed

C++11/14/17/20 templates and generic programming, the most complex and difficult technical details of C++, indispensable in building infrastructure libraries.

Language:C++Apache-2.0100

byteps

A high performance and generic framework for distributed DNN training

NOASSERTION000

ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

Apache-2.0000

core

The core library and APIs implementing the Triton Inference Server.

BSD-3-Clause000

CTranslate2

Fast inference engine for Transformer models

MIT000

FasterTransformer

Transformer related optimization, including BERT, GPT

Apache-2.0000

fastllm

纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行

Apache-2.0000

flashinfer

FlashInfer: Kernel Library for LLM Serving

Apache-2.0000

fun-rec

推荐系统入门教程，在线阅读地址：https://datawhalechina.github.io/fun-rec/

NOASSERTION000

graph-learn

An Industrial Graph Neural Network Framework

Language:C++Apache-2.0000

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.

Apache-2.0000

llm.c

LLM training in simple, raw C/CUDA

MIT000

onnx-modifier

A tool to modify ONNX models in a visualization fashion, based on Netron and Flask.

MIT000

onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

MIT000

PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Apache-2.0000

ppl.kernel.cpu

000

ppl.llm.kernel.cuda

Apache-2.0000

pytorch-diffusion

pytorch复现stable diffusion

000

pytorch-transformer

pytorch复现transformer

000

PytorchOCR

基于Pytorch的OCR工具库，支持常用的文字检测和识别算法

000

rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Apache-2.0000

seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation

NOASSERTION000

sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Apache-2.0000

SimpleGPUHashTable

A simple GPU hash table implemented in CUDA using lock free techniques

Unlicense000

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache-2.0000

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache-2.0000

whutbd

whutbd's repositories

cuda-learn-note

Cpp-Templates-2ed

byteps

ByteTransformer

cmake-demo

CMakeTutorial

core

CTranslate2

FasterTransformer

fastllm

flashinfer

fun-rec

ggml-tutorial

graph-learn

How_to_optimize_in_GPU

llm.c

onnx-modifier

onnxruntime

PaddleOCR

ppl.kernel.cpu

ppl.llm.kernel.cuda

pytorch-diffusion

pytorch-transformer

PytorchOCR

rtp-llm

seamless_communication

sentencepiece

SimpleGPUHashTable

TensorRT-LLM

vllm