ar4 / deepwave

Wave propagation modules for PyTorch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributed (multi-GPU) execution

lnnnn123 opened this issue · comments

Hello, Professor.
I tried to run the elastic wave equation on multiple Gpus, but I encountered some problems. In the attachment is my modified code, but from the results, there are two horizontal lines dividing the speed, and the gradient is not updated. Another problem is that I found that the results of a single shot fire save file and a second shot fire update gradient are relatively poor. There is also coupling between multiple parameters, and I want to know how it works.

Thanks
多参数问题.pdf

This is my code, because there are too many, I converted it into a file, thank you for your correction.
elatasic.txt

Thanks for your guidance, after my testing, I finally found the reason, I ran the program on four Gpus at the same time, one GPU could run, but it could not conduct the gradient well, resulting in the loss of inversion results. This is the result of my latest run, obviously it can conduct the gradient very well, but it still appears two lines, I guess it should be multiple GPU conduction gradient, not as correct as one GPU conduction, I don't know what you think?
test.pdf

This is my code, the main change is to run the code on 0,1,3 Gpus, welcome your correction.
elatasic.txt

I think I see the problem. In Deepwave's DataParallel example (https://github.com/ar4/deepwave/blob/master/docs/example_distributed_dp.py) the models are passed to the constructor of the propagator object, not when the propagator is being applied. This is because PyTorch's DataParallel divides input Tensors between GPUs on the specified dimension (0 by default), so if you pass the models when the propagator is applied the models will also be divided among the GPUs on this dimension. By passing them to the constructor instead, you avoid this.

So, I suggest that you modify your Prop class to take vp, vs, and rho as inputs in __init__ rather than passing them when applying the propagator:

class Prop(torch.nn.Module):

    def __init__(self, dx, dt, freq, vp, vs, rho):
        super().__init__()

        self.dx = dx
        self.dt = dt
        self.freq = freq
        self.vp = vp
        self.vs = vs
        self.rho = rho

    def forward(self, source_amplitudes, source_locations, receiver_locations):
        out = elastic(
            *deepwave.common.vpvsrho_to_lambmubuoyancy(self.vp, self.vs,
                                                       self.rho),
            self.dx,
            self.dt,
            source_amplitudes_y=source_amplitudes,
            source_locations_y=source_locations,
            receiver_locations_y=receiver_locations,
            pml_freq=self.freq,
        )
        return out[-2]

I hope that will resolve the problem you encountered, but note that PyTorch's documentation recommends using DistributedDataParallel rather than DataParallel for better performance. Here is an example of using it with Deepwave: https://github.com/ar4/deepwave/blob/master/docs/example_distributed_ddp.py

Thank you for your guidance. My multi GPU running program is running correctly. The gradient can be conducted correctly.

I hope this Issue is resolved, so I am going to close it. Please feel free to reopen it if you have further questions about this.