osnr / template-matching

πŸ”Ž Image template matching using Apple vDSP.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

template-matching

Template matching using Apple Accelerate (vDSP) API in pure C.

(TODO: see my !!con 2022 talk about this)

Implements 2D normalized cross-correlation (normxcorr2) to quickly find instances of template (patch) image on another image.

run

$ make slow && ./slow
image 1032x515, templ 62x19 -> result 1093x533 in 1.276023 sec
hit at (142, 108)
hits: 1
$ make fast && ./fast
image 1032x515, templ 62x19 -> result 1093x533 in 0.036434 sec
hit at (142, 108)
hits: 1

(also check out result.png and hits.png after each run)

sources and notes

Based on a mix of stuff (Googling normxcorr2 is a good place to start, that seems to be a nice well-defined shorthand name that means the same function when people use it):

I literally implemented it in terms of basic numpy operations and FFT first so I could understand it and compare the code to known-good code, then went and translated each line to the 2-5 equivalent lines of C.

Key things are:

  • use multiplication of the FFTs instead of naive convolution (because the kernel is really big, it's not your 3x3 or 5x5 or whatever)

  • use summed-area tables to help with normalization (some of the implementations above get lazy and use convolution with a table of ones to get sums, and that's slower)

  • use the Apple vDSP APIs, don't use for loops if you can avoid it

  • precision is important -- floats are not good enough for the running sum if you're summing over arrays of floats, you need to go 1 level higher-precision (doubles)

potential performance wins

We're still maybe 50% slower than OpenCV.

  • use real FFT instead of complex FFT. I honestly could not figure out how to do this -- the packing is so weird -- and it doesn't seem to speed it up that much (you can try just stubbing it in and seeing how long it takes to run even if you get garbage values)

  • get rid of remaining memcpys?

  • incrementality? if you have optical flow or otherwise know how an image changed

  • cache FFT of template in memory. this saves about 1/7 of runtime I think, but it complicates the code a lot right now

random thoughts

  • it would be nice to free all the buffers at the end, or have an arena allocator or something

  • it would be nice to automatically be able to view every stage of the image processing

  • there are various baseline things that it's good to know the speed of -- how long does an FFT pass take? how long does a for loop through a whole image take?

About

πŸ”Ž Image template matching using Apple vDSP.


Languages

Language:C 97.9%Language:Makefile 2.1%