Distributed (multi-GPU) execution

Question

Distributed (multi-GPU) execution

lnnnn123 opened this issue 2 months ago · comments

Hello, Professor.
I tried to run the elastic wave equation on multiple Gpus, but I encountered some problems. In the attachment is my modified code, but from the results, there are two horizontal lines dividing the speed, and the gradient is not updated. Another problem is that I found that the results of a single shot fire save file and a second shot fire update gradient are relatively poor. There is also coupling between multiple parameters, and I want to know how it works.

Thanks
多参数问题.pdf

Alan Richardson · Answer 1 · Thu Jun 06 2024 18:05:58 GMT+0800 (China Standard Time)

Hello, In Deepwave parallel processing is done by splitting the shot dimension over devices. In your example it looks like there is an attempt to somehow instead split the spatial domain of the model over devices. Is your code similar to the one in the example https://ausargeo.com/deepwave/example_distributed ? If so, please send me your whole code so that I can see what is causing the problem. Coupling between the parameters in elastic propagation is normal and expected. You may parameterise the model differently, if you wish. You need to pass lambda, mu, and buoyancy as the inputs to Deepwave, but the parameters that you actually optimise can be chosen by you (you just need to convert them into lambda, mu, and buoyancy before passing them to Deepwave).

lnnnn123 · Answer 2 · Fri Jun 07 2024 09:42:33 GMT+0800 (China Standard Time)

Hello, This is my code, the first is a single gun FWI program, the second is the result of running on multiple Gpus. Thank you very much for taking the time to answer my question. At 2024-06-06 18:06:20, "Alan Richardson" ***@***.***> wrote: Hello, In Deepwave parallel processing is done by splitting the shot dimension over devices. In your example it looks like there is an attempt to somehow instead split the spatial domain of the model over devices. Is your code similar to the one in the example https://ausargeo.com/deepwave/example_distributed ? If so, please send me your whole code so that I can see what is causing the problem. Coupling between the parameters in elastic propagation is normal and expected. You may parameterise the model differently, if you wish. You need to pass lambda, mu, and buoyancy as the inputs to Deepwave, but the parameters that you actually optimise can be chosen by you (you just need to convert them into lambda, mu, and buoyancy before passing them to Deepwave). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Alan Richardson · Answer 3 · Fri Jun 07 2024 17:24:21 GMT+0800 (China Standard Time)

Your code unfortunately does not seem to have come through. Perhaps if you reply using the GitHub website you will be able to attach it.

lnnnn123 · Answer 4 · Fri Jun 07 2024 18:51:51 GMT+0800 (China Standard Time)

This is my code, because there are too many, I converted it into a file, thank you for your correction.
elatasic.txt

Alan Richardson · Answer 5 · Fri Jun 07 2024 19:35:41 GMT+0800 (China Standard Time)

Thank you for your code. It looks reasonable, but I do not have multiple GPUs to test it on. Do you still have the problem you mentioned in your first post, with the strange horizontal lines in the optimised results? If so, I suggest first simplifying your code to make it easier to identify the problem by changing to use the code that you get by modifying the Deepwave elastic example as little as possible to run it on multiple GPUs (so just create a Prop module and apply DataParallel to it). Please let me know what the result is. I am sorry that this is causing trouble for you.

lnnnn123 · Answer 6 · Sun Jun 09 2024 17:49:05 GMT+0800 (China Standard Time)

Thanks for your guidance, after my testing, I finally found the reason, I ran the program on four Gpus at the same time, one GPU could run, but it could not conduct the gradient well, resulting in the loss of inversion results. This is the result of my latest run, obviously it can conduct the gradient very well, but it still appears two lines, I guess it should be multiple GPU conduction gradient, not as correct as one GPU conduction, I don't know what you think?
test.pdf

Alan Richardson · Answer 7 · Mon Jun 10 2024 00:09:46 GMT+0800 (China Standard Time)

Thank you for the update. I am glad to hear that you made progress. That result does look better than before, but there is still something strange happening. I can't imagine what is causing it to seemingly only update the model in the middle depths. The result when you run on multiple GPUs should be almost identical to running on a single GPU. Unfortunately, since I do not have multiple GPUs, it is hard for me to debug it, but would you mind sending me your latest code anyway, so that I can check to see if I can find anything that might be causing this behaviour?

lnnnn123 · Answer 8 · Mon Jun 10 2024 13:26:30 GMT+0800 (China Standard Time)

This is my code, the main change is to run the code on 0,1,3 Gpus, welcome your correction.
elatasic.txt

Alan Richardson · Answer 9 · Mon Jun 10 2024 18:44:36 GMT+0800 (China Standard Time)

I think I see the problem. In Deepwave's DataParallel example (https://github.com/ar4/deepwave/blob/master/docs/example_distributed_dp.py) the models are passed to the constructor of the propagator object, not when the propagator is being applied. This is because PyTorch's DataParallel divides input Tensors between GPUs on the specified dimension (0 by default), so if you pass the models when the propagator is applied the models will also be divided among the GPUs on this dimension. By passing them to the constructor instead, you avoid this.

So, I suggest that you modify your Prop class to take vp, vs, and rho as inputs in __init__ rather than passing them when applying the propagator:

class Prop(torch.nn.Module):

    def __init__(self, dx, dt, freq, vp, vs, rho):
        super().__init__()

        self.dx = dx
        self.dt = dt
        self.freq = freq
        self.vp = vp
        self.vs = vs
        self.rho = rho

    def forward(self, source_amplitudes, source_locations, receiver_locations):
        out = elastic(
            *deepwave.common.vpvsrho_to_lambmubuoyancy(self.vp, self.vs,
                                                       self.rho),
            self.dx,
            self.dt,
            source_amplitudes_y=source_amplitudes,
            source_locations_y=source_locations,
            receiver_locations_y=receiver_locations,
            pml_freq=self.freq,
        )
        return out[-2]

I hope that will resolve the problem you encountered, but note that PyTorch's documentation recommends using DistributedDataParallel rather than DataParallel for better performance. Here is an example of using it with Deepwave: https://github.com/ar4/deepwave/blob/master/docs/example_distributed_ddp.py

lnnnn123 · Answer 10 · Thu Jun 13 2024 10:38:39 GMT+0800 (China Standard Time)

Thank you for your guidance. My multi GPU running program is running correctly. The gradient can be conducted correctly.

Alan Richardson · Answer 11 · Thu Jun 13 2024 17:12:13 GMT+0800 (China Standard Time)

That's wonderful news. Thank you for letting me know. I am glad that I was able to help.

Alan Richardson · Answer 12 · Mon Jul 15 2024 17:35:24 GMT+0800 (China Standard Time)

I hope this Issue is resolved, so I am going to close it. Please feel free to reopen it if you have further questions about this.