rosikand / torchplate

🍽 A minimal and simple experiment module for machine learning research workflows in PyTorch.

Home Page:https://rosikand.github.io/torchplate/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Interface with Hugging Face Accelerate for distributed training

rosikand opened this issue · comments

Create a new distributed_train function in torchplate.experiment.Experiment which interfaces with Hugging Face Accelerate for zero-overhead distributed training of PyTorch models. Avoid .to(device) placements as the accelerate library will handle this for you. Can call this function even with one GPU.

Optional parameters:

  • split_batches=True: whether to use true batch size or distributed batch size

Note: I think you will have to make heavy edits to get it to interface with the metrics properly (see this function). Also, true for model serialization.