distributed-training

There are 17 repositories under distributed-training topic.

Made-With-ML
GokuMohandas / Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Language:Jupyter Notebook 44206
huggingface / pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
augmix convnext distributed-training efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training optimizer pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models
Language:Python 35704
PaddlePaddle / Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）
deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability
Language:C++ 23375
PaddleNLP
PaddlePaddle / PaddleNLP
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie
Language:Python 12836
metaflow
Netflix / metaflow
Build, Manage and Deploy AI/ML Systems
agents ai aws azure cost-optimization datascience distributed-training gcp generative-ai high-performance-computing kubernetes llm llmops machine-learning ml ml-infrastructure ml-platform mlops model-management python
Language:Python 9610
skypilot-org / skypilot
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem).
cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu
Language:Python 8919
Fengshenbang-LM
IDEA-CCNL / Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。
aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers
Language:Python 4150
FedML-AI / FedML
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training
Language:Python 3964
bytedance / byteps
A high performance and generic framework for distributed DNN training
deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow
Language:Python 3708
tensorflow / adanet
Fast and flexible AutoML with learning guarantees.
automl tensorflow learning-theory deep-learning neural-architecture-search gpu machine-learning ensemble tpu python distributed-training
Language:Jupyter Notebook 3459
determined
determined-ai / determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
data-science deep-learning distributed-training hyperparameter-optimization hyperparameter-search hyperparameter-tuning keras kubernetes machine-learning ml-infrastructure ml-platform mlops pytorch tensorflow
Language:Go 3193
alpa-projects / alpa
Training and serving large-scale neural networks with auto parallelization.
alpa auto-parallelization compiler deep-learning distributed-computing distributed-training high-performance-computing jax llm machine-learning
Language:Python 3162
learning-at-home / hivemind
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
deep-learning pytorch volunteer-computing mixture-of-experts distributed-training distributed-systems asynchronous-programming asyncio dht hivemind neural-networks machine-learning
Language:Python 2279
intelligent-machine-learning / dlrover
DLRover: An Automatic Distributed Deep Learning System
distributed-training hacktoberfest k8s llm-training
Language:Python 1584
pytorch / gloo
Collective communications library with various primitives for multi-machine training.
collectives distributed-training pytorch
Language:C++ 1365
tensorlayer / HyperPose
Library for Fast and Flexible Human Pose Estimation
tensorlayer tensorflow openpose pose-estimation computer-vision distributed-training tensorrt mobilenet neural-networks
Language:Python 1262
DeepRec-AI / DeepRec
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine
Language:C++ 1144
mryab / efficient-dl-systems
Efficient Deep Learning Systems course materials (HSE, YSDA)
cuda deep-learning distributed-training efficient-deep-learning machine-learning ml-infrastructure mlops pytorch
Language:Jupyter Notebook 905
alibaba / Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
deepspeed distributed-training llama llm megatron-lm pretraining pytorch
Language:Python 659
sail-sg / oat
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.
alignment dpo llm llm-aligment rlhf thompson-sampling online-alignment dueling-bandits distributed-training distributed-rl llm-exploration online-rl reasoning grpo ppo r1-zero
Language:Python 561
LambdaLabsML / distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
cuda deepspeed distributed-training gpu gpu-cluster kuberentes nccl pytorch slurm cluster fsdp lambdalabs mpi sharding
Language:Python 530
Guitaricet / relora
Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
deep-learning distributed-training llama nlp peft transformer
Language:Jupyter Notebook 467
adaptdl
petuum / adaptdl
Resource-adaptive cluster scheduler for deep learning training.
deep-learning kubernetes pytorch distributed-systems aws distributed-training machine-learning cloud
Language:Python 448
Oneflow-Inc / libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
oneflow nlp deep-learning large-scale data-parallelism model-parallelism distributed-training pipeline-parallelism transformer self-supervised-learning vision-transformer
Language:Python 407
meta-pytorch / torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
airflow aws-batch components deep-learning distributed-training kubernetes machine-learning pipelines python pytorch ray slurm
Language:Python 399
aws-samples / awsome-distributed-training
Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
aws awsbatch distributed-training efa eks generative-ai gpu hyperpod llm-training parallelcluster
Language:Shell 365
DataCanvasIO / HyperGBM
A full pipeline AutoML tool for tabular data
adversarial-validation automl catboost dask dask-distributed datacleaning distributed-training ensemble-learning fullpipeline gbm gpu-acceleration lightgbm preprocessing pseudo-labeling rapidsai semi-supervised-learning sklearn tabular-data xgboost
Language:Python 358
PKU-DAIR / Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
artificial-intelligence autograd data-science deep-learning deep-neural-networks distributed-systems distributed-training embeddings gpu high-dimensional machine-learning python state-of-the-art
Language:Python 322
maudzung / YOLO3D-YOLOv4-PyTorch
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4
Language:Python 306
HandyRL
DeNA / HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning
Language:Python 300
lsds / KungFu
Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.
tensorflow keras distributed-training distributed-systems
Language:Go 298
nanodl
HenryNdubuaku / nanodl
A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.
attention attention-mechanism deep-learning distributed-training flax gpt jax llama machine-learning mistral nlp transformer
Language:Python 296
alibaba / EasyParallelLibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
deep-learning data-parallelism model-parallelism pipeline-parallelism memory-efficient distributed-training gpu
Language:Python 270
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
distributed-training llm pytorch
Language:Python 268
awslabs / deeplearning-cfn
Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow
deeplearning tensorflow mxnet ec2-instance aws aws-cloudformation distributed-training deep-learning-ami
Language:Python 252
dougsouza / pytorch-sync-batchnorm-example
How to use Cross Replica / Synchronized Batchnorm in Pytorch
pytorch batchnorm distributed-training dataparallel
247