ZZK's repositories
Awesome-GPU
Awesome resources for GPUs
cmake-examples
Useful CMake Examples
oneflow
OneFlow is a performance-centered and open-source deep learning framework.
AITemplate
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Awesome-System-for-Machine-Learning
A curated list of research in machine learning systems (MLSys). Paper notes are also provided.
CacheLib
Pluggable in-process caching engine to build and scale high performance services
Cpp-Concurrency-in-Action-2ed
C++11/14/17/20 multithreading, involving operating system principles and concurrent programming technology.
CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully :)
CV-CUDA
CV-CUDA™ is an open-source, graphics processing unit (GPU)-accelerated library for cloud-scale image processing and computer vision.
data
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
DI-engine
OpenDILab Decision AI Engine
fast_io
Significantly faster input/output for C++20
FasterTransformer
Transformer related optimization, including BERT, GPT
FBGEMM
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
FBTT-Embedding
This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed this library can reduce the total model size by up to 100x in Facebook’s open sourced DLRM model while achieving same model quality. Our implementation is faster than the state-of-the-art implementations. Existing the state-of-the-art library also decompresses the whole embedding tables on the fly therefore they do not provide memory reduction during runtime of the training. Our library decompresses only the requested rows therefore can provide 10,000 times memory footprint reduction per embedding table. The library also includes a software cache to store a portion of the entries in the table in decompressed format for faster lookup and process.
free-programming-books
:books: Freely available programming books
GPT2
An implementation of training for GPT2, supports TPUs
matxscript
The model pre-processing and post-processing framework
MNN
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
QSync
Official resporitory for "QSync: Adpative Mixed-Precision for Training Synchronization".
taichi-hackathon-akinasan
Akinasan team(秋名山车队)'s code base for the 0th Taichi Hackathon.
TensorRT
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
YHs_Sample
Yinghan's Code Sample