realfastvla / rfgpu

GPU-based gridding and imaging library for realfast

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

suggestions for multigpu use case in rfpipe

caseyjlaw opened this issue · comments

Currently, rfpipe uses a single function to set up rfgpu on a single GPU. For realfast, that requires a one-to-one mapping of data read and data searched on a GPU. That will limit GPU utilization if large amounts of memory need to be read at once.
As an example, reading L-band with 10 ms sampling and an FRB search requires reading ~10 GB of memory. For a server with 8 GPUs, we will either need to limit the memory usage per read or share the search work for a single read over multiple GPUs.
I can see one way to use multiple GPUs in rfpipe, but it is pretty invasive. I wanted to ask for suggestions before implementing it. The basic flow is:

rfgpu.cudaSetDevice(devicenum)
[set up grid, etc.]
for i in resampling_list:
  for dm in dm_list:
    grid.operate()
    image.operate()
    [threshold]
return peaks

Is there overhead to settingcudaSetDevice? Could we use it within the loops to run two GPUs concurrently?

Yeah, being able to use multiple GPUs from a single host thread may be useful. According to the CUDA doca, cudaSetDevice "should be considered a very low overhead call." It might be cleaner to handle this inside rfgpu though; for example each Grid, Image, etc instance will be associated with a certain device, and rfgpu will automatically switch as appropriate when each is called. That way the calling program doesn't have to manage all this. Sound good?

Yes. I was hoping it could be worked into rfgpu for the sake of the overall design. Do you think it would also be possible to make the vis_raw.h2d() command transfer to a set of devicenums?

Is there a clever way to transfer concurrently? If not, the h2d transfer time will scale up linearly with number of devices, while the search time will scale down in the same way. When multiplexing over many GPUs, we'll eventually be spending more time moving data than searching.

Yes we should be able to do concurrent transfers no problem. Even if we didn't, the way the Array class is currently structured, there is a one-to-one mapping between GPU and host memory, so this needs changing to address the overall memory usage issue mentioned earlier.

Not checked in yet but I've done some work on this. Here is a bit of info on the Array interface changes. I've tried to keep it all backwards-compatible. Let me know if you have any suggestions:

a = GPUArrayComplex((1024,1024)) allocates a 1024-by-1024 array on whatever device is currently selected (via cudaSetDevice). This is the same as it always was.

a = GPUArrayComplex((1024,1024),(1,2,3)) allocates the array on devices 1, 2 and 3.

a.h2d() sends the data from CPU memory to all devices associated with this array. Transfer is done asynchronous, so should go in parallel to all the devices. Also the function returns before the transfer is finished. The GPUs will handle this automatically (ie not start other stuff til it's done) but there could possibly be issues if you modfiy the CPU array while data transfer is still happening.

a.d2h() will transfer the data from the default device (first one in the list) to the CPU memory. a.d2h(n) will transfer from device n.

I'm reviewing my current rfpipe code and see that the GPUArrayComplex object that is moved to the GPU is actually the raw visibilities (with shape (nbl, nchan, nints)). A Grid object is then defined to use that memory to downsample, dedisperse, etc.. Would moving the raw data (rather than the complex grid) still supported by the new interface?
How do you anticipate the grid object will use different devices (e.g., when running grid.operate for a given DM, integration, etc.)?

Yes, that's right. Everything I described in the previous comment has to do with moving chunks of raw data between CPU and GPU memory. In short the change allows for a single chunk of CPU memory to map into several GPUs, rather than a strictly one-to-one mapping as before.

How I plan Grid to work is that when an instance is declared, it will have an extra argument specifying which GPU it uses, ie grid = rfgpu.Grid(nbl, nchan, ntime, upix, vpix, device). Then any time you call a Grid method such as operate() it will automatically use that GPU. Will raise an error if you try to pass it an array that is not defined on that device, etc. Since each Grid instance is tied to a specific GPU, you'll need to have one per each GPU in use.

Does that all sound OK?

I'm sure I could work with that.
I suppose another approach would be to make the methods of Grid (e.g., downsample, operate), take a device as an argument. Perhaps that would mean there'd only be a need to make a single Grid object? Either way, multiple Array objects would be needed to manage the resampled data and images.

Having a single Grid instance able to work with different devices would take more work since there is a bunch of internal data that would then need to be duplicated on all the devices and kept track of. Implementing the scheme I described above was pretty straightforward so I went ahead and did it. This is checked in now on the "multi_gpu" branch. Give it a try and let me know if you see any issues.

It should be completely backwards compatible; I've tried some of my existing simple test cases to make sure they still work, but have not really exercised the multi-gpu capability too much. If I come up with an example script for this I will let you know.

Also, everything I said above about how Grid now works also applies to Image.

Ok, I imagined there could be structural issues. I'm excited to play with this!

Do I understand that you've only implemented the GPU memory for the GPUArrayComplex class? I tried to set it up for GPUArrayReal and got this error:

TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. rfgpu.GPUArrayReal()
    2. rfgpu.GPUArrayReal(arg0: int)
    3. rfgpu.GPUArrayReal(arg0: List[int])
    4. rfgpu.GPUArrayReal(arg0: List[int], arg1: List[int])

Invoked with: (1944, 2304), 0

Sorry, ignore that last question. I see the syntax in the code.

OK, hope that makes sense. In the "two lists" version the first is the array dimensions and the second is the list of devices. I should add some better arg names and docstrings at some point.

Just trying this out a bit more today. While it seems to work (in the sense of not crashing or otherwise producing bad results), the GPUs are not being used in parallel in the way I thought they would when calling multiple devices from a single thread. So it probably needs a bit more work before it should be integrated "for real" into the pipeline.

Also, I'm adding some docstrings and argument names to the python interface, should help with code readability, for example you can do a = rfgpu.GPUArrayComplex(dims=(1024,512),devices=(1,2,3)). Some of this has been committed already, some still in progress.

I have set up some concurrent execution code around the grid/image portion of the rfgpu code. It works pretty well for 1 or 2 GPUs, but does not scale linearly beyond that.

Do you have a simple example of this? Or is it in rfpipe somewhere?

Yes, in the development branch. The concurrent part is done at:
https://github.com/realfastvla/rfpipe/blob/development/rfpipe/search.py#L137

I've restructured this a few times to try to get it to scale well. It could be simpler than this version and still scale the same (i.e. sublinear for >2 gpus).

Reading my code again, I suspect the poor scaling could be due to the use of a python for loop inside the thread. The individual call to grid.operate and image.operate are fast, but each time those run, the python interpreter checks the stats object and then steps to the next integration of the for loop.
The scaling probably depends on the time it takes for Python to switch context between threads and the number of GPU images being created. If that number is comparable to the time for a single image, then it will scale poorly.
If so, then solutions would be to (1) find a way to wrap the for loop in cython or (2) move the iteration over integrations into rfgpu.

That might be the explanation. I'd like to reproduce the rfpipe usage in a simple example and run it through nvprof to really see what is going on though. I started this but haven't finished yet.

I think putting an iteration over multiple images into rfgpu is probably a good idea in any case though, and is something I've thought about before. This would also allow things like using batched FFTs and should improve efficiency for small images sizes where it looked like the current implementation is dominated by kernel launch overheads.

OK, I figured out part of the problem, I accidentally committed a version of the Makefile that had the GPU timing code enabled. This will in some circumstances slow down multi-GPU usage because it makes many of the routines wait for the GPU to finish before returning, preventing some parallel operations from happening. That may not completely explain what you were seeing, but you can pull the latest (with those two line re-commented) and try again. Good idea to make clean first to make sure everything gets re-built.

I have a test script set up and am able to reproduce your results; run time improves by ~50% going from 1 to 2 GPUs, and not much after that. I'll play around with it and let you know if I find out anything useful.

I've pulled the latest on the the multi_gpu branch and rebuilt, but importing to Python fails:

In [1]: import rfgpu                                                                                                                                                          
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-37fd4ba54c94> in <module>
----> 1 import rfgpu

ImportError: /home/cbe-master/realfast/anaconda/envs/development3/lib/python3.6/site-packages/rfgpu.so: undefined symbol: _ZN5rfgpu5Timer4stopEv

Going back one commit (to 4c94821) fixes it.

I added Makefile as a dependency so everything will get rebuilt if it's changed, sorry for the confusion.

Just noting here that as discussed in #13 multi_gpu branch has been merged into master.

I took another look at this today. Reducing the amount of cudaSetDevice and cudaGetDevice calls did not have any effect on the multi-GPU performance so I think I will leave that code as-is.

I think the bad scaling is as you say due to python. I made a simple change to have the Image.stats() function release the GIL and this seemed to have helped a lot. I now get linear runtime scaling for up to 4 GPUs with 1k-by-1k images (this probably depends on image size and other params). Could you test it out sometime and see if this works in your code? Also all this GIL stuff is always semi-mysterious to me.. I think this change is safe but if you think otherwise please let me know ;)

For additional improvement I will still think about processing batches of N images at a time as we discussed last week. But maybe hold of on this for now in favor of working on phasing / dynamic spectra?

I built the new version (with nice new build scripts!), but can't get the scaling to improve beyond what it did before. I see 2x improvement for 2 GPUs, but none beyond that.
I have some ideas, so I'll play more on my end.