processing more than one image?

Question

processing more than one image?

acycliq opened this issue 7 months ago · comments

Dimitris Nicoloutsopoulos commented 7 months ago

Hi Brian

Just wondering, do you know by any chance if there is a way to process more than one image at the same time (or almost at the same time) in a single gpu setup? I have a pool of threads, each one holding an image and currently I send them to the deconvolution algo sequentially with a thread lock. I wonder if there is anything I could do to scale it up (assuming the the gpu memory doesnt get exhausted)

As a sidenote, the readme.md, line 17, the link to the dask example points at the non-dask example

Apologies if this is not the correct place to ask!

Brian Northan · Answer 1 · Tue Oct 31 2023 05:01:47 GMT+0800 (China Standard Time)

Most of the real GPU deconvolution applications I've worked on are memory limited so I haven't looked into sending mulitple deconvolutions to the GPU in parallel.

I could be wrong, but I believe you could do this by creating multiple command queues. The line of code where the command queue is created is here. When you call the python wrapper you eventually end up in the deconv3d_32f_tv function. Have you tried calling richardson_lucy_nc in parallel from two different threads? I am curious if it would work since potentially each thread could create a different command queue.

Just in case you also use cupy I have a cupy version of non-circulant richardson lucy here. It isn't as well tested as the opencl version, and it also does not yet have TV regularization. However maybe it would turn out to be easier to create multiple cuda 'streams' then it is multiple opencl 'queues'.

Dimitris Nicoloutsopoulos · Answer 2 · Tue Oct 31 2023 20:25:37 GMT+0800 (China Standard Time)

I dont know if my understanding is wrong but this is what I have written.

Suppose I have a list of paths pointing to some images that I want to process (along with the relevant psf). In main I split the list into chunks and I create a pool of processes to process each chunk. By default I spawn as many python processes as my logical cpu cores. Each process has the same task to carry out and this is set in the denoise function. Therein, I create a thread lock. That will stop threads from the same process from accessing the gpu but I believe it will let two threads from two different processes do so; The thread lock will work within processes but not across processes.

As the program below is set out right now, I create only one big chunk, hence I believe that all data will be handled by just one single python process and it looks to work. However if I split the data into chunks (by setting n_workers = 2 in the first line of the main) then I get memory errors.

I believe that happens because two threads are accessing my gpu at the same time. I might be mistaken though, not 100% sure. There might be the case that I just run out of memory but the image is not that big I think.

I am going to have a look at your cupy version! Thanks a lot for the help!

EDIT 1: Moved the io.imread() calls outside the lock to benefit from the multithreading context. Images will be read and then wait in the thread pool to get picked-up by the gpu when the latter becomes available.

EDIT 2: Updated the names of the images and psf

from skimage import io
from concurrent.futures import ProcessPoolExecutor
from clij2fft.richardson_lucy import richardson_lucy_nc
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing.dummy import Lock


def main(filepaths):
    # determine chunksize
    n_workers = 1  
    chunksize = max(1, round(len(filepaths) / n_workers))
    # create the process pool
    with ProcessPoolExecutor() as executor:
        # split the load operations into chunks
        for i in range(0, len(filepaths), chunksize):
            # select a chunk of filenames
            chunk = filepaths[i:(i + chunksize)]
            # submit the task
            executor.submit(denoise, chunk)
    print('Done')


def denoise(filepaths):
    """
    Reads the nd2 files using multithreading and sets a thread lock so only one thread has access to
    the gpu at any given time.
    """
    lock = Lock()
    items = [(d, lock) for d in filepaths]
    with ThreadPool() as pool:
        res = pool.starmap_async(task, items)
        res.wait()


def task(d, lock):
    im = io.imread(d[0])
    psf = io.imread(d[1])
    with lock:
        rl = richardson_lucy_nc(im, psf, 50, 0.002)
        print("processed image id: %d" % d[2])


if __name__ == '__main__':
    N = 50
    img_filepath = r"/some/path/to/Bars-G10-P15-stack.tif"
    img_filepaths = [img_filepath for _ in range(N)]

    psf_filepath = r"/some/path/to/PSF-Bars-stack.tif.tif"
    psf_filepaths = [psf_filepath for _ in range(N)]

    img_id = [d for d in range(N)]

    print("Started")
    tuples = list(zip(img_filepaths, psf_filepaths, img_id))
    main(tuples)
    print("Finished")

Dimitris Nicoloutsopoulos · Answer 3 · Thu Nov 02 2023 23:01:11 GMT+0800 (China Standard Time)

Update The listing above looks to work when n_workers = 2. Just tried it on a bigger gpu and runs fine. It looks that two streams are accessing the gpu in parallel. For some reason however (which I dont understand) the total execution time barely changes

n_workers: 1    elpapsed_time: 341 secs
n_workers: 2    elpapsed_time: 335 secs

Brian Northan · Answer 4 · Sat Nov 04 2023 17:46:20 GMT+0800 (China Standard Time)

I'm not surprised there wasn't too benefit in running in parallel on one GPU. I have not dived too deeply into optimizing multiple processes on the GPU. I've done a little bit of that on CPU, and I remember having too dive into the computer architecture a bit, like how the processors, cores on processors and cache levels were organized, then use thread affinity (binding threads to a particular CPU or core) to optimize things such that the process would be run on one set of 'components' rather than the operating system just scheduling it (potentially not optimally).

So perhaps there may be similar optimizations that could be made with Cuda Streams or Open Cl Queues. Like specifying somehow what number of cores the streams/queue binds to.

I'm being sort of vague and "hand-wavy" here, but it probably comes down to specifying how the processes are to be run. I believe there are some optimizations that can be made as to when the data is transferred to the GPU, such that one thread will be transferring data, while the other processes. I think that is one of the major features of Cuda Streams.

Dimitris Nicoloutsopoulos · Answer 5 · Sat Nov 04 2023 21:27:38 GMT+0800 (China Standard Time)

Ok, thanks a lot for your insightful comments but also for this very nice lib. If somehow I manage to make two streams get processed in parallel I willl update you. But please let me know if ever come across someone else in the community achieving this.

Dimitris Nicoloutsopoulos · Answer 6 · Tue Nov 07 2023 01:21:18 GMT+0800 (China Standard Time)

Just wanted to add this. It looks like I do send multiple streams to the GPU, screenshot below is my nvtop while running the program I posted a few days ago with the addition of calling the cupy based RL deconvolution.
Screenshot shows gpu usage when I use 5 workers (with cupy, non-circular deconvolution) and it can be seen that resources are split across those workers, hence no wonder there is no benefit when running in parallel as you mentioned. Maybe I havent written my program in the best possible way

Brian Northan · Answer 7 · Sat Jan 06 2024 22:47:49 GMT+0800 (China Standard Time)

Hi @acycliq

I added the dask code to the cupy version here.

I also modified it to support multiple GPUs, though that part still needs some testing.

Dimitris Nicoloutsopoulos · Answer 8 · Sat Jan 06 2024 23:27:17 GMT+0800 (China Standard Time)

Thanx so much Brian!