clij / clij2-fft

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Runtime error

acycliq opened this issue · comments

Many thanks for this library. I am getting an error, I imagine it is a long shot but when I try to call richardson_lucy_nc I receive

get lib
Runtime error: ret returned -4 at /home/bnorthan/code/i2k/clij/clij2-fft/native/clij2fft/clij2fft.cpp:864Runtime error: ret returned -4 at /home/bnorthan/code/i2k/clij/clij2-fft/native/clij2fft/clij2fft.cpp:1222

I do get a return array back but the image looks almost black, very low intensities.

Also when I try richardson_lucy(img. psf, iterations, regularization_factor) (this is the total variation algo right?), or the standard Richardson-Lucy (ie without the regularization factor) I receive

get lib
Runtime error: ret returned -4 at /home/bnorthan/code/i2k/clij/clij2-fft/native/clij2fft/clij2fft.cpp:1224

In this case the return image, looks quite the same as the input image

The Test_Installation.py script work fine

Any idea what could have gone wrong please? I can share the data if that helps..

Thank you!

How much memory do you have on your GPU and what is the size of the image you are processing. Maybe it is an out of memory error?

If an out of memory error you can try to call the deconvolution with dask. See this script for an example https://github.com/clij/clij2-fft/blob/master/python/clij2fft/test_richardson_lucy_dask.py

Yes indeed! You are correct, it is a memory issue. Unfortunately I have to work with an outdated RTX2060, 6GB. I have a 3d image with dims 60 planes by 2300-2300px and I tried your implementation on the mid part of it, ie selected the 20 planes in the middle of the stack. I works quite nice and it is fast.

Apologies for my ignorance, would it make sense to make 3 thinner images, each of dimension 20planes by 2300-2300px, feed them to the richardson_lucy function and then combine them at the end?

Unfortunately I wasnt able to run succesfully the non circular function, I am getting the same error, I assume it is my gpu memory

Many thankx!

Hi @acycliq

I just added an easier way to access the dask (chunked) version of richardson_lucy_nc see here.

To get access to this you will have to update clij2-fft by running

pip install clij2-fft

Note that clij2-fft recently moved from the test pypi, to the main pypi so the installation command is slightly different (and easier) now. You should get version 0.23 (the latest).

Then you can call

from clij2fft.richardson_lucy_dask import richardson_lucy_nc_dask

decon=richardson_lucy_nc_dask(im, psf_decon, numiterations=100, regularizationfactor=0.005, x_chunk_size=512, y_chunk_size=512, overlap=24)

Where you set x_chunk_size, and y_chunk size to values small enough to handle your image. Maybe try 512 and work down from there if it still doesn't work.

I don't usually chunk in z because the PSF is often elongated in z, and chunking in the z dimension can lead to artifacts.

Hi Brian

Thanks so much! In the meanwhile I have been experimenting on my side too. I almost ended up with something along the lines

chunk_size = (60, 2304/2, 2304/2) # my image is 60 by 2304-2304px
arr = da.from_array(img, chunks=chunk_size)

def deconv(arr):
    return richardson_lucy(arr, psf, 50, 0.002)

rl = arr.map_overlap(deconv, depth=(6,6,6), boundary='reflect', dtype='float32').compute(num_workers=1)

I will post a more complete example when I finish, maybe it will help some others in the community but your update maybe makes that redundant. Anyways, I now playing around with the dask_array.map_blocks function.

I didnt chunk the z too, I just split in half both x and y and my gpu processes it with out problems. Btw how can I flush my gpu, I have been experimenting with different chunk sizes and once I send too big chunks to be processed then the next calls fail despite the new chunks are a lot smaller.

Also, the num_workers in the compute call above, setting that to a higher number would only be meaningfull in a multi-gpu system, do I get that correct please? I tried with 2 workers and it fails.

Thanks again! I have to say its a really nice lib, runs so fast! It is convenient too it doesnt require cuda, hence can be run on wider spectrum of configurarions if I understad well (apple M1 for example, altthough Im not using any....)

If you get a chance you could add to this discussion.

I am not sure how to reset the GPU. I will have to check the opencl API to see if there is anything. I will also check over the c++ part of the code and make sure I am cleaning up everything after an error occurs.

My understanding is that 2 workers should use 2 GPUs, but I do not have a multi-GPU system so haven't tested this. When I get a chance I'll start up an AWS VM with multiple GPUs to test this.

In addition to running on M1 the clij2-fft implementation of RL has total variation regularization noise handling, and non-circulant edge handling, both of which can be very important in some situations.

Ok yes sure, I will add to that post, thanks for the link.

At the moment I am flushing the gpu manually, ie on the terminal I do:

sudo fuser -v /dev/nvidia*

then I find the pid associated with python and i kill it

sudo kill -9 PID

(Thats on Linux, I do not know on WIndows...)

I asked because I was thinking to send the full image first to the deconvolution algo (most ppl have better gpus than I do....) and if it fails then to catch it in a try/catch loop and feed it by chunks. But I would have to flush my gpu memory after the exception. I imagine it would run faster in one go than in chunks

Anyways, thanks once more, you helped me massively. I keep this open, I will close it when I post my coding to the discussion thread you pointed me to, (I think tomorrow...)

Another approach would be to check how much memory is available on the GPU, then compute the chunk size from there. It is apparently a bit tricky to get live memory metrics with opencl, but we can at least get the amount of memory on the card. Something like

import pyopencl as cl

platforms = cl.get_platforms()

devices=platforms[0].get_devices()

for device in devices:
    print(device)
    print(device.get_info(cl.device_info.GLOBAL_MEM_SIZE))

Then base chunking size on card memory. The NCRLTV algorithm needs apr. 9 copies of the processed image size. The image that is processed will be (imagesize + psfsize/2 + overlap) in each dimension. I probably will make a small utility for this, but won't get around to it until next week.

Hi Brian

Thanks for the code above about checking the memory. I have put together this one

import numpy as np
import skimage
from clij2fft.richardson_lucy_dask import richardson_lucy_nc_dask
from clij2fft.richardson_lucy import richardson_lucy, richardson_lucy_nc
import tifffile as tif
import os
import dask.array as da
import pyopencl as cl


def gpu_mem():
    platforms = cl.get_platforms()

    devices = platforms[0].get_devices()
    mem_size = []
    for device in devices:
        mem_size.append(device.get_info(cl.device_info.GLOBAL_MEM_SIZE))
    return min(mem_size)

def rl_mem_footprint(img, psf, depth=(0, 0, 0)):
    img_size = np.array(img.shape)
    psf_size = np.array(psf.shape)
    depth_size = np.array(depth)

    total_size = np.prod(img_size + psf_size / 2 + depth_size)

    img_bytes = total_size * 4  # float32 needs 4 bytes
    img_bytes = img_bytes * 9  # the RL deconv algo needs 9 copies
    return np.ceil(img_bytes)

def chunk_factor(img, psf, depth):
    """
    If chunks are needed, the 3D image will be chunked along x and y only because as
    Brian said the psf could be elongated along z and chunking it on z could create artifacts.
    I will be chunking the image in such a manner that I will be creating 4 or 16 or 32 etc
    chunks. That means that if I split x by 2 then y will also be split by 2 which
    will result in 4 chunks with the same aspect ratio (on xy) as the original image.
    Similarly, to get 16 chunks (if needed) I will be splitting both x and y by 4
    The image x,y dimensions are assumed to be even-numbered,
    ie they can be divided by 2. No check is done for this
    """
    gpu_bytes = gpu_mem()
    img_bytes = rl_mem_footprint(img, psf, depth)

    cf = 1
    if gpu_bytes <= img_bytes:
        # we wamt to find an integer k such that:
        # img_bytes / 4^k <= gpu_bytes

        # inflate by 1 byte so that if img_bytes==gpu_bytes then chunks will be produced.
        img_bytes = img_bytes + 1

        k = np.ceil(np.emath.logn(4, img_bytes/gpu_bytes))

        # 4^k is the number of chunks.
        # The find how much x and y must be split-by take the square root
        cf = np.sqrt(4 ** k)
    return cf

def decon_func(psf, algo, iter, regularisation):
    if algo == "richardson_lucy":
        def decon(img):
            return richardson_lucy(img, psf, iter, 0.0)
    elif algo == "richardson_lucy_tv":
        def decon(img):
            return richardson_lucy(img, psf, iter, regularisation)
    elif algo == "richardson_lucy_nc":
        def decon(img):
            return richardson_lucy_nc(img, psf, iter, regularisation)
    else:
        raise ValueError('Not implemented')
    return decon

def deconvolver(img, psf, algo, iter, regularisation):
    depth = (0, 6, 6)  # hardcoding this, maybe not good idea. O for z since no chunking there, depth on x, y is arbitrary i think??
    k = chunk_factor(img, psf, depth)
    print(k)
    assert k % 2 == 0, "chunking can be done in multiples of 2 only"
    
    chunk_size = (img.shape[0], img.shape[1]/k, img.shape[2]/k)
    print('chunk size: %s' % list(chunk_size))
    # chunked dask array
    img = da.from_array(img, chunks=chunk_size)
    print(regularisation)
    rl = img.map_overlap(decon_func(psf, algo, iter, regularisation),
                         depth=depth,
                         boundary='reflect',
                         dtype='float32').compute(num_workers=1)
    return rl

imgName = r"/path/to/some/image.tif"
psfName=r"path/to/some//psf.tif"

img=skimage.io.imread(imgName)
psf=skimage.io.imread(psfName)

rl = deconvolver(img, psf, "richardson_lucy_tv", iter=50, regularisation=0.0002)

rl[rl >= 2**16] = 2**16-1 # saturate values >= 65536
tif.imwrite('richardson_lucy_uint16.tif', rl.astype(np.uint16), bigtiff=True)

When you got some time, let me know if it makes sense, if I got the memory calculations correct etc...

Thanks

I took the utility functions you wrote and wrapped them into to the code base. See #23

Let me know if you have any other suggestions before I merge this.

Brian

Thanks for accepting them. I spotted some typos:

line 57: change 32 to 64
line 87: change wamt to want
line 96: change The find to To find out

I dont have anything more to add. You can also close the issue. I guess there is no point to add anything to the dicussion thread at the image.sc forum. Your new release will cover that.