There are 17 repositories under distributed-training topic.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem).
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
Fast and flexible AutoML with learning guarantees.
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
Training and serving large-scale neural networks with auto parallelization.
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
DLRover: An Automatic Distributed Deep Learning System
Library for Fast and Flexible Human Pose Estimation
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
Efficient Deep Learning Systems course materials (HSE, YSDA)
Best practice for training LLaMA models in Megatron-LM
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.
Best practices & guides on how to write distributed pytorch training code
Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
A full pipeline AutoML tool for tabular data
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow
How to use Cross Replica / Synchronized Batchnorm in Pytorch