Gaussian Blur implementation is very slow

Question

Gaussian Blur implementation is very slow

EhsanKia opened this issue 5 years ago · comments

On a simple 1080x800px image, with kernel of size 10, it takes well over 15s. Not sure if there's more efficient ways to implement it, is it the same method used as the Java version?

I wonder if it's possible to use the browser's CSS filter for guassian blur, then transfer that back to a canvas.

Mark Valentino · Answer 1 · Mon Sep 25 2023 04:44:13 GMT+0800 (China Standard Time)

I looked through the code that does Gaussian blur, and the efficiency is at least O(n)^4 or larger. There is two for loops to iterate through the image, and there are two for loops to iterate through the kernel. It is possible to completely unroll the for loops, as in make them disappear, for iterating through the image kernel. That would make the efficiency at O(n)^2.

Example in Python:

gausKernel = [[1/16, 2/16, 1/16], [2/16, 4/16, 2/16], [1/16, 2/16, 1/16]]
accum = 0
data2 = copy.deepcopy(data)
for i in range(1, data.shape[0] - 1):
    for j in range(1, data.shape[1] - 1):
        accum += gausKernel[0][0] * data2[i - 1][j - 1][0]
        accum += gausKernel[0][1] * data2[i - 1][j][0]
        accum += gausKernel[0][2] * data2[i - 1][j + 1][0]
        accum += gausKernel[1][0] * data2[i][j - 1][0]
        accum += gausKernel[1][1] * data2[i][j][0]
        accum += gausKernel[1][2] * data2[i][j + 1][0]
        accum += gausKernel[2][0] * data2[i + 1][j - 1][0]
        accum += gausKernel[2][1] * data2[i + 1][j][0]
        accum += gausKernel[2][2] * data2[i + 1][j + 1][0]
        accum = round(accum)
        data2[i][j][0] = accum
        data2[i][j][1] = accum
        data2[i][j][2] = accum
        accum = 0
        
        
image3 = Image.fromarray(data2)
display(image3)

For variable kernel sizes, it would obviously make the code larger to maintain since you would need multiple functions.

Ehsan Kia · Answer 2 · Mon Sep 25 2023 08:48:50 GMT+0800 (China Standard Time)

You're not magically reducing the time complexity by unrolling the inner loops. It wasn't really O(n^4) before, it was O(w*h*k^2) and it still is after your change. But yeah this is a very slow operation you'd probably want to do on the gpu in parallel.