YOLOv2 Accelerator in Xilinx's Zynq-7000 Soc(PYNQ-z2, Zedboard and ZCU102)

A Demo for accelerating YOLOv2 in Xilinx's FPGA PYNQ-z2, Zedboard and ZCU102 I have graduated from Jiangnan University, China in July 1, 2019. Related papers are available now.
Master thesis "Research of Scalability on FPGA-based Neural Network Accelerator"
Journal article "Design and implementation of FPGA-based deep learning object detection system"
Journal article "Design and Implementation of YOLOv2 Accelerator Based on Zynq7000 FPGA Heterogeneous Platform"

For PYNQ-z2 and Zedboard, in addition to final Linux application( For PYNQ, turn to PYNQ directory; For Zedboard and ZCU102, turn to SDK and PetaLinux), other steps are almost same:

(1)Software Simulation

Firstly, you should download the darknet source from https://github.com/pjreddie/darknet and yolov2.weights from https://pjreddie.com/media/files/yolov2.weights.

Secondly, modify the darknet's weight load function to get the weights and biases that we want(Here, considering that batcn normalizaton can be combined with weight and bias).

Thirdly, considering that multiple and add operations that implemented in hardware logic will cost too high resources in FPGA[3][6], we should use lower percision operation instead of float-32. Here, I just follow [3] and [6] to quantize the input/output feature maps, weights and biases to dynamic fixed-16. And use fixed-16 operation to replace multiple, add and relu operations in float-32 percision.

(2)HLS Accelerator and Simulation

Oh, this part is too complicated to brightly intoduce. Current design didnt implemment C/RTL simulation, because that testbench always overflow! if anyone can solve it, please tell me and upload it, Thanks!

(3)Vivado Block Design

Just connect the YOLOv2 IP in Vivado Block Design. Only the clock wizzard configuraiotn should be careful. I remembered that input clock is 100MHz, output's clock is 150MHz, Reset pin active low, that's all.

(4)Vivado SDK for Zedboard

This step just wants to get the executable file to driver and control YOLOv2 Acceleraotr in PL. Here, I reserved 0x1000_0000 bytes memories for accelerator to read/wirte feature maps and read weights.

(5)PetaLinux

Related steps have been updated in Petalinux direcotry. Just use two files(.hdf file and .bit file) that generated from Vivado project to create one Peatalinux. Then test yolov2 acclerator in it.

Every directory has some steps to help further implement or study this accelerator.

Design and Optimization of YOLOv2 Accelerator Based on FPGA

According to the analysis of the YOLOv2 network, most layers are serially processed, except for the routing layer. The routing layer can be implemented by setting a specific address in advance.
From an accelerator perspective, the work required is to interact with memory in order (reading memory data, processing data, and then writing back memory data). Since the amount of data input and output is very large, loop tiling technique is always applied to reuse data and reduce memory access times, which tiles the convolution loop R, C, M, N to Tr, Tc, Tm ,Tn[8].
The overall architecture of the accelerator is shown below:

Similar to [4,5,8], the accelerator has two AXI4 master interfaces and one AXI4-Lite slave interface. AXI-Lite slave interface is responsible for reading and writing control, data and status register sets. The input feature maps and weights are read concurrently by two master interfaces, and the output feature maps are written back simultaneously through write channel.
The Data Scatter module is designed to generate the corresponding write address and distribute the data read from the DRAM to the on-chip buffers. The Data Gather module is designed to generate the DRAM write-back address and write the data in the output buffer back to the DRAM. The other red modules are responsible for the processing of the convolutional layer (Conv and Leaky ReLU), the maximum pooling layer (Pool) and the reorg layer (Reorg).

Weight Arrangement

The effective FPGA bandwidth goes up with the increase of burst length and finally flattens out above some burst length threshold[7]. The data tiling technique usually results in a discontinuous DRAM access for the row-major data layout in DRAM. To reduce the number of memory accesses and increase the effective memory bandwidth, we arrange the kernel weights for an entire tile to a continuous block to ensure a high utilization of the bandwidth of external memory [3].

Parallel Convolution Engine

The acceleration strategy of convolutional layer is similar to [5][6], which utilizes input and output parallelism to accelerate the computation. By designing multiple parallel multiplication units and add trees to achieve input parallelism (Tn parallelism) and output parallelism (Tm parallelism) in convolution calculation. The Tm*Tn multiplication units are calculated in parallel. The add trees of Log2 (Tn) depth are accumulated by pipeline, and generate the partial sums.

Ping-Pong operation

Similar to [8], the design implements ping-pong buffers to overlap the delay of reading input feature maps and weights, writing output feature maps and calculation, which greatly improves the dynamic utilization of the computing engines.

Evaluation

Experiments show that floating point addition in HLS requires three DSP resources, floating point multiplication requires two DSPs; fixed point 16-bit multiplication requires one DSP, and fixed-point 16-bit addition can be implemented only using LUT. After placing and routing, resource consumptions of fixed-16 (Tn=2, Tm=32, Tr=26, Tc=26) are shown as follows:

Resource	DSP	BRAM	LUT	FF	Freq	Dev
INT16(n4m32) old	153(69%)	88(63%)	35977(68%)	36247(34%)	150MHz	Zedboard
FT32(n4m23) old	209(95%)	115(82%)	36348(68%)	64077(60%)	140MHz	Zedboard
INT16(n4m32) old	147(6%)	88(10%)	36759(13%)	30447(6%)	180MHz	ZCU102
FT32-(n8m28,CONV II=3,POOL II=2) default float32	259(72%)	91(42%)	31985(45%)	53728(38%)	180MHz	EdgeBoard(ZU3EG)
FT32-(n4m36,CONV II=3,POOL II=2) current float32 mp	334(93%)	109(50%)	43877(62%)	73854(52%)	150MHz	EdgeBoard(ZU3EG)

According to the current design, DSP and BRAM are more expensive. The cost of DSP can be further reduced (there are many bit-width redundant multiplications), and the BRAM cost can be reduced. (As Shen [1] said, BRAM allocates an exponential size of 2 in HLS. Actually, many BRAMs are redundant. ).
The performance comparison in the two cases is shown in the following table:

Performance
CNN models	YOLO v2	YOLO v2	YOLO v2	YOLO v2	YOLO v2	YOLO v2
Board	PYNQ	Zedboard	ZCU102	Zedboard	ZU3EG	ZU3EG
Clock(MHz)	150	150	180	140	180	150
Precision	Fixed-16	Fixed-16	Fixed-16	Float-32	Float-32	Float-32
Power (W)	2.98	1.20	?	?	?	?
Operations (GOP)	29.47	29.47	29.47	29.47	29.47	29.47
Performance(GOP/s)	25.98	30.15	36.13	6.63	11.81	13.08
Power Efficiency(GOP/s/W)	4.20	6.02	?	?	?	?

New Evaluation

just further test existed design, and given more details for other researchs. (2023.11.06) Vivado, Vivado HLS 2019.2. Linux app compiled with -static -lm, in release mode -O2 opt.

Platform:

EdgeBoard(ZU3EG): 1.2GHz A53 4 cores + 4GiB DDR4 + FPGA

ID	DataType	hls_target_clk	Tn/Tm/Tr/Tc/II_CONV/II_POOL/PP_I+W,O	DSP	BRAM	LUT	FF	Freq (MHz)	Dev	ref repo
A	FT32	3.0	4/28/26/32/3/3/1+1,1	259(72%)	90.5(42%)	31983(45%)	57683(41%)	200	EdgeBoard(ZU3EG)	02_FT32
B	FT32	3.0	4/36/26/32/3/3/4&4,2	334(93%)	109.0(50%)	44855(64%)	78699(56%)	190	EdgeBoard(ZU3EG)	02_FT32_mp_r4w2
C	INT16	3.0	8/24/26/26/1/2/1&1,1	253(70%)	88.0(41%)	50447(71%)	25249(18%)	190	EdgeBoard(ZU3EG)	02_INT16_128b
D	INT16	3.0	8/24/26/26/1/2/1+1,1	253(70%)	90.0(42%)	51296(73%)	27005(19%)	190	EdgeBoard(ZU3EG)	02_INT16_128b

*PP_I+W,O, means that parallel data ports in accelerator interface; In Design A, [1+1,1] represents that ifm and weight own independent port (+ means or). In Design B, [4&4, 2] represents that ifm and weight buffers share same 4 ports, and ofm buffers own 2 concurrent write-back ports.

ID	A	B	C	D
CNN models	YOLO v2	YOLO v2	YOLO v2	YOLO v2
Board	ZU3EG	ZU3EG	ZU3EG	ZU3EG
Acc-Clock(MHz)	200	190	190	190
current/available Bit_DataBus (bit)	32/128	32/128	128/128	128/128
Precision	FT32	FT32	INT16	INT16
Power (cpu idle + static fpga + dynamic cpu & fpga, W)	6.63 + 0.55 + 1.82	6.63 + 0.70 + 2.23	6.63 + 0.27 + 0.77	6.63 + 0.30 + 1.07
Operations (GOP)	29.472	29.472	29.472	29.472
Latency* (s)	2.255	1.801	0.475	0.469
Performance(GOP/s)	13.069	16.364	62.020	62.840
Power Efficiency(GOP/s/W)	5.514	5.585	59.634	45.868

*Latency did not include post-process stage (e.g., the last region layer and image saving procedure) in CPU. Power Efficiency only evaluates the static + dynamic power in FPGA & CPU. CPU power could be further improved to close useless module and bus.

Result

References:

[1] Maximizing CNN Accelerator Efficiency Through Resource Partitioning
[2] PLACID: A Platform for FPGA-Based Accelerator Creation for DCNNs
[3] Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
[4] DianNao A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning
[5] An Automatic RTL Compiler for High-Throughput FPGA Implementation of Diverse Deep Convolutional Neural Networks
[6] A Dynamic Multi-precision Fixed-Point Data Quantization Strategy for Convolutional Neural Network
[7] Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks
[8] Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

dhm2013724 / yolov2_xilinx_fpga