CUDA Templates for Linear Algebra Subroutines
OneFlow is a performance-centered and open-source deep learning framework.
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
A curated list of research in machine learning systems (MLSys). Paper notes are also provided.
Pluggable in-process caching engine to build and scale high performance services
Useful CMake Examples
C++11/14/17/20 multithreading, involving operating system principles and concurrent programming technology.
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
OpenDILab Decision AI Engine
Significantly faster input/output for C++20
Transformer related optimization, including BERT, GPT
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed this library can reduce the total model size by up to 100x in Facebook’s open sourced DLRM model while achieving same model quality. Our implementation is faster than the state-of-the-art implementations. Existing the state-of-the-art library also decompresses the whole embedding tables on the fly therefore they do not provide memory reduction during runtime of the training. Our library decompresses only the requested rows therefore can provide 10,000 times memory footprint reduction per embedding table. The library also includes a software cache to store a portion of the entries in the table in decompressed format for faster lookup and process.
:books: Freely available programming books
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
Tutorials for writing high-performance GPU operators in AI frameworks.
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）
A treasure chest for visual classification and recognition powered by PaddlePaddle
Paddle Distributed Training Examples. 飞桨分布式训练示例 Resnet Bert GPT MOE DataParallel ModelParallel PipelineParallel HybridParallel AutoParallel Zero Sharding Recompute GradientMerge Offload AMP DGC LocalSGD Wide&Deep
👑 Easy-to-use and powerful NLP library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis and 🖼 Diffusion AICG system etc.
Practical low-rank gradient compression for distributed optimization: https://arxiv.org/abs/1905.13727
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
Yinghan's Code Sample