Support for Multi-GPU Parallel Training in chargpt.py

Question

Support for Multi-GPU Parallel Training in chargpt.py

JinXiaofeng1234 opened this issue 3 months ago · comments

the holy roman empire commented 3 months ago

Hello minGPT Team,

I recently rented a cloud service with 4 NVIDIA RTX 4090 GPUs, aiming to leverage them for training models using your chargpt.py script. However, I encountered an issue where the script seems to utilize only the memory of a single GPU (24GB), which is insufficient for my training requirements.

Given the potential of multi-GPU training to significantly reduce training time and handle larger models or datasets, I'm interested in modifying chargpt.py to support multi-GPU parallel training. Could you provide guidance or suggestions on how to achieve this? Specifically, I'm looking for advice on integrating PyTorch's DataParallel or DistributedDataParallel functionalities into the script.

I appreciate any help or pointers you can provide. Thank you for your time and for the great work on the minGPT project.

Best regards