Interface with Hugging Face Accelerate for distributed training

Question

Interface with Hugging Face Accelerate for distributed training

rosikand opened this issue 2 years ago · comments

Create a new distributed_train function in torchplate.experiment.Experiment which interfaces with Hugging Face Accelerate for zero-overhead distributed training of PyTorch models. Avoid .to(device) placements as the accelerate library will handle this for you. Can call this function even with one GPU.

Rohan Sikand · Answer 1 · Fri Dec 30 2022 04:11:34 GMT+0800 (China Standard Time)

Optional parameters:

split_batches=True: whether to use true batch size or distributed batch size

Rohan Sikand · Answer 2 · Fri Dec 30 2022 04:12:46 GMT+0800 (China Standard Time)

Note: I think you will have to make heavy edits to get it to interface with the metrics properly (see this function). Also, true for model serialization.