WeiCheng14159 / VSD_CNN_accelerator

A complete SW/HW co-design system for mask detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A complete SW/HW co-design system for mask detection


Motivation

Wearing a mask and social distancing were the only measure to deal with COVID-19 before the presence of vaccines. Not to mention, the upcoming COVID variant leads to many breakthorugh cases around the world. Wearing a mask seems to be an effective measure against COVID infection. Despite the effectiveness of wearing a mask, enforcing such public health measure on customers is timing consuming for small businesses. Therefore, we promote a solution based on a ASIC CNN accelerator that is capable of monitoring the mask wearing restriction automatically.

Slides

More details can be found in our presentation slides. Also, the presentation is prettier on ppt than this markdown page.

Hardware system architecture

Our mask detection CNN accelerator has the following HW architecture: Alt text

CPU

  • 5-stage pipeline (textbook style)
  • Implement 45 RV32I instructions
  • Direct mapped L1-I$ and L1-D$
  • Partially implement CSR instructions (M-mode only)

AXI

  • Connects 3 master and 7 slaves
  • AXI bridge verified by Cadence Assertion-Based Verification IP (ABVIP)

EPU (Extended Processing Unit)

  • 180 KiB weight buffer
  • 2 KiB bias buffer
  • 384 KiB output buffer
  • 384 KiB input buffer
  • 3x3/1x1 convolution, max-pooling units

RAM

  • 64KB instruction memory (IM)
  • 64KB data memory (DM)

DRAM

  • Off-chip memory simulated by testbench
  • tPR (Precharge time) = 5
  • tRCD (Row Address to Column Address Delay) = 5
  • CL (CAS latency) = 5

ROM

  • 16 KB off-chip memory

ASIC spec

Single clock domain, Speed = 100 MHz, UMC 180 um process
CPU CPU+I$+D$ ~1mm^2
RAM IM (32KB) ~2.6 mm^2
DM (32KB) ~2.6 mm^2
EPU Buffers Bias buffer ~0.3 mm^2
Weight Buffer ~7.8 mm^2
Input Buffer ~16 mm^2
Output Buffer ~16 mm^2
EPU ~0.6 mm^2
Total area (SYN) 47.6 mm^2
Total area (APR) 67.9 mm^2

Alt text ** The layout of this chip is partially done. IR drop has not been considered during APR.

NN quantization & compression

  • We use a NIN (Network In Network) model [1] and apply CLIP-Q [2] algorithm to quantize and compress the network.

  • NIN architecture in brief:

    • NIN architecture from the original paper includes the stacking of three conv layers and one global average pooling layer.

    Alt text

  • The CLIP-Q algorithm in brief:

    • CLIP-Q combines weight pruning and quantization in a single learning framework, and performs pruning and quantization in parallel with fine-tuning. The joint runing-quantization adapts over time with the changing network.

    Alt text

    • The following example illustrates three steps of the pruning and quantization operations for a layer with 16 weights, p = 0.25 and b = 2. 75% of full precision weights that are close to zero will pruned as p = 0.25 (clipping), and weights are paartitioned into $2^b=4$ partitions (partitioning). Weights fall into the same partition will be averaged out, and the averaged results will be the quantization levels. Therefore, only 4 weights (assume b = 2) need to be stored for each layer, and each weight in could be represented by merely 2 bits.

    Alt text

NN arhitecture

NIN NN model

We modified the NIN model by replacing average pooling layer with max pooling and removing batch normalization layer to simplify the design of hardware. The modified version of NIN consists of the following layers:

Alt text

We are able to obtain ~82% accuracy with this model on CIFAR-10 dataset while the original NIN model (full precision without quantization and pruning) is capable of reaching 90% accuracy. The possible cause might be the removal of batch normalization layers and the replacement of average pooling layers.

Mask detection NN model

  • We further shrink the modified model for mask detection based on the NIN structure, and eventually came up with the following model setup:

Alt text

We then apply CLIP-Q quantization and pruning algorithm to further compress this model so that it fits in our NN accelerator. Eventually we are able to obtain a ~82% accuracy on our custom mask wearing dataset.

Contribution

Special thanks to @Wder4 @NCKUMaxSnake @sam2468sam @alan-chen1412 @hsiehong @GuFangYi @WeiCheng14159 for their contribution.

Reference

  • [1] Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
  • [2] Tung, F., & Mori, G. (2018). Clip-q: Deep network compression learning by in-parallel pruning-quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7873-7882).

About

A complete SW/HW co-design system for mask detection


Languages

Language:Verilog 49.8%Language:SystemVerilog 33.4%Language:Python 6.1%Language:Perl 4.2%Language:Assembly 2.1%Language:Makefile 2.0%Language:C 1.8%Language:Tcl 0.5%Language:Shell 0.1%Language:Forth 0.0%