Interface with Hugging Face Accelerate for distributed training
rosikand opened this issue · comments
Rohan Sikand commented
Create a new distributed_train
function in torchplate.experiment.Experiment
which interfaces with Hugging Face Accelerate for zero-overhead distributed training of PyTorch models. Avoid .to(device)
placements as the accelerate library will handle this for you. Can call this function even with one GPU.
Rohan Sikand commented
Optional parameters:
split_batches=True
: whether to use true batch size or distributed batch size
Rohan Sikand commented
Note: I think you will have to make heavy edits to get it to interface with the metrics properly (see this function). Also, true for model serialization.