Support for Multi-GPU Parallel Training in chargpt.py
JinXiaofeng1234 opened this issue · comments
Hello minGPT Team,
I recently rented a cloud service with 4 NVIDIA RTX 4090 GPUs, aiming to leverage them for training models using your chargpt.py script. However, I encountered an issue where the script seems to utilize only the memory of a single GPU (24GB), which is insufficient for my training requirements.
Given the potential of multi-GPU training to significantly reduce training time and handle larger models or datasets, I'm interested in modifying chargpt.py to support multi-GPU parallel training. Could you provide guidance or suggestions on how to achieve this? Specifically, I'm looking for advice on integrating PyTorch's DataParallel
or DistributedDataParallel
functionalities into the script.
I appreciate any help or pointers you can provide. Thank you for your time and for the great work on the minGPT project.
Best regards