more detailed explanation of Multi GPU

Question

more detailed explanation of Multi GPU

hafezmg48 opened this issue 2 years ago · comments

I was wondering if it is possible to add more explanation in the README.md file for the multi-GPU section. Specifically a brief explanation of which functions and parts of the inference and training and to which level are parallelized with multiple GPUs.

From my basic knowledge, I understood that the inference is completely run on single GPU with all the forward paths. In the training phase, the batches of data are shared between multiple GPUs and the results of the losses and gradients are summed up. Could you please clarify a little. Thanks a lot.

Chinthaka · Answer 1 · Tue May 07 2024 11:10:17 GMT+0800 (China Standard Time)

I think what you mentioned is correct.
We have Distributed data parallel implemented.
Data loader is designed to shard the dataset to each process without any overlaps.
After each training step, MPI+NCCL average and distribute the gradients and losses across all the processes.
Each process needs its own GPU. DDP variant of using a single GPU by multiple processes is not implemented.

Chinthaka · Answer 2 · Tue May 07 2024 11:16:06 GMT+0800 (China Standard Time)

Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathy 's roadmap.

Hafez Mousavi · Answer 3 · Tue May 07 2024 21:35:18 GMT+0800 (China Standard Time)

Multi GPU inference is not implemented yet. Only training is supported. Let’s hope we will see Tensor and Pipeline Parallel implementations for inference in near future 😁. Let’s stay updated on @karpathys roadmap.

Absolutely! @karpathy is doing a lot of great stuff and I learned a lot here. The training is distributed between the processes as separate batches. This will be more than enough for speedups in the training phase. But for the inference parallelism, I was hoping they would do something similar to what is implemented in https://github.com/ggerganov/llama.cpp as --tensor-split in the example/main program.

Aleksa Gordić · Answer 4 · Fri Jun 07 2024 22:00:32 GMT+0800 (China Standard Time)

Adding to what @chinthysl has said we now also support ZeRO stage 1, where we shard the optimizer states, so only a shard of gradients is updated on each device and then params are gathered.