more detailed explanation of Multi GPU
hafezmg48 opened this issue · comments
I was wondering if it is possible to add more explanation in the README.md file for the multi-GPU section. Specifically a brief explanation of which functions and parts of the inference and training and to which level are parallelized with multiple GPUs.
From my basic knowledge, I understood that the inference is completely run on single GPU with all the forward paths. In the training phase, the batches of data are shared between multiple GPUs and the results of the losses and gradients are summed up. Could you please clarify a little. Thanks a lot.
I think what you mentioned is correct.
We have Distributed data parallel implemented.
Data loader is designed to shard the dataset to each process without any overlaps.
After each training step, MPI+NCCL average and distribute the gradients and losses across all the processes.
Each process needs its own GPU. DDP variant of using a single GPU by multiple processes is not implemented.
Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathy 's roadmap.
Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathys roadmap.
Absolutely! @karpathy is doing a lot of great stuff and I learned a lot here. The training is distributed between the processes as separate batches. This will be more than enough for speedups in the training phase. But for the inference parallelism, I was hoping they would do something similar to what is implemented in https://github.com/ggerganov/llama.cpp as --tensor-split in the example/main program.
Adding to what @chinthysl has said we now also support ZeRO stage 1, where we shard the optimizer states, so only a shard of gradients is updated on each device and then params are gathered.