Aligner

Overview

Circuit Aligner with CUDA

Rotation-invariant template matching considering GPU architecture
Providing a web frontend with Django
Available at 1/10 the price of traditional products

1 Rotation-invariant template matching considering GPU architecture

Accelerating rotation-invariant template matching with CUDA parallel algorithms while considering GPU architecture

1.1 Motivation

Red crosshairs: Characteristic parts of aligned circuits saved in the aligner
Blue crosshairs: Characteristic parts of the misaligned circuit that entered the aligner

Rotation-invariant template matching is necessary to compute the amount of rotation and translation of a circuit.

The user wants to align the misaligned circuit by rotating and translating it in two dimensions. The aligned circuit already has the characteristic patches of the aligned circuit stored. The aligner finds the same patches in the misaligned circuit and calculates the amount of rotation and translation of the circuit based on the degree of misalignment between the patches.

1.2 Template matching overview

Template matching performs the following coarse to fine search:

Coarse search: Fast search based on pixel statistics, taking advantage of the GPU architecture.
Fine search: Precise search using circular sum

1.3 Step 1: Coarse Search

Fast search based on pixel statistics, taking advantage of the GPU architecture. A coarse search is performed as follows:

Find the 1st-4th order moments of a reference patch.
Construct an ROI around a reference patch in a misaligned circuit image.
Traverse all pixels in the ROI to calculate the 1st-4th order moments. The region where the moments are calculated is centred on the pixel being traversed and has the same size as the reference patch.
Compare the reference patch moment vector to the moment vector of all patches in the ROI.
Extract candidate regions using an adaptive threshold.

1.3.1 1st to 4th Order Moments

The 1st to 4th order moments represent the mean, variance, skewness, and kurtosis of the pixels in the patch. These features are invariant to rotation or translation of the patch. For example, a patch with a mean of 3 will still have a mean of 3 if you move it slightly to the right.

1.3.2 Grid, Block, and Thread Structures and Shared Memory

The unit for performing kernels (functions) on the GPU is a grid. A grid is made up of blocks. Blocks are composed of threads. Threads within the same block can access shared memory and share data. Shared memory is very fast, so it should be actively used to improve performance.

In Step 1, the aligner's grid consists of (columns/16 x rows/16) blocks. Each block consists of (16 x 16) threads. Each thread is responsible for one patch to compute moments.

Within each block, threads share a pixel area of (16+margin x 16+margin). Due to the limitations of the Jetson Nano, it is not possible to allocate a large amount of shared memory.

For speed, all GPU memory, including shared memory, is pre-allocated and the image is uploaded.

1.4 Step 2: Fine Search

The aligner accumulates pixel values along a circular path. Each radius of the circle has a cumulative value. With n radiuses, an n-dimensional vector of accumulated values can be constructed. This vector encodes the rotation-invariant features of the image.

Find the circular sum of the reference patches.
Perform a circular sum in parallel similar to Step1 on the candidates that passed the adaptive threshold in Step1. Each thread is responsible for two circular paths.
Apply NCC between the reference and candidate vectors to pick the pixel with the maximum value.

1.4.1 Algorithms for accelerating parallel computing

As mentioned in Step 2, each thread is responsible for two circular paths. One thread is responsible for the k th circle and the n-k th circle to reduce bottlenecks between threads.

Additionally, a reduction algorithm is used to extract maxima and sums to accelerate the process.

2 Providing a web frontend with Django

Django provides a web interface. Users can access the aligner from any device, not just a PC. In addition to PC, the responsive web is also conveniently accessible on mobile.

The sorter backend and Django frontend send and receive data through shared memory. For image sharing, a reference-only object is used to minimise latency.

3 Available at 1/10 the price of traditional products

At the time the aligner was developed, previous circuit alignment PCs cost around $1,000. However, this aligner, based on Jetson Nano, costs around $100. Despite the low price, there is no reduction in performance (search time < 50ms).

4 File Description

AlignerLauncher: main()
Aligner: sets up and tears down shared memory, sets up and tears down cameras, processes images, and communicates with the web.
PatternMatching: Manages GPU memory, runs the CUDA kernel, and performs image processing using OpenCV.
CudaSupporter: Image processing, finding the moment vector and circular sum vector for each pixel using CUDA.
Grabber: Pylon camera classes, setting up and destroying cameras and grabbers
MemorySharing: Sharing memory with the web.
Webcam: Enables the webcam if there is no Pylon camera.
AlignerConsts: Constants
kbhit: checks for key presses