A parametric RTL code generator of an efficient integer MxM Systolic Array implementation for Xilinx FPGAs.
In a systolic array, there is a rythmic style of computation, in which, at every clock cycle, input data is pumped in, and output data is pumped out. The term systolic is therefore a reference to the functioning of a biological heart[1].
There are a number of mathematical operations that can be implemented using systolic arrays, but the one in this project is a weight stationary matrix multiplier. Nowadays, systolic arrays are the architectural core of state-of-the-art neural network accelerators, such as Google's DPU[2] and Xilinx's TPU[3].
This implementation uses 8-bit integer representation for the inputs, which allows for simultaneosly executing two multiplications in a single DSP[4]. Furthermore, a time-multiplexing scheme is employed on the DSPs[5][6], allowing them to run twice as fast as the rest of the logic. Thus, overall, each DSP is able to execute four 8-bit integer multiplications per clock cycle. The adders responsible for accumulation are implemented with CLB[7][8] elements, such as LUTs and CARRYs.
Hence, the Processing Elements (PEs) that constitute the array are multiply-accumulate (MAC) units.
Given a systolic array of size NxN:
- DSPs: N2 DSP48E[1[5]|2[6]] (1 for each PE)
- BRAMs: 6N RAMB18E[1[9]|2[10]] (N for each input/output matrix: ABCD,E,W,X,Y,Z)
- Operations/Cycle: 8N2 (N2 PEs, 2x2xMUL + 4xADD per PE)
- Frequency: Will mostly depend target device, but can also depend on N ()
- 8x8/14x14 @ XC7Z020 @ 200MHz
- 8x8/14x14/32x32 @ XCZU9 @ 300MHz
- : Relevant repository documentation.
- : Vivado project for a 2x2 array, including testbenches, and an use-case scenario with AXI DMA.
- : Python script for generating RTL (edit 'settings.py', run 'main.py', import '/RTL/import_me/*').
- : OOC Vivado projects, scripts, and reports for synth/place/route of 8x8/14x14/32x32 arrays on 7000/UltraScale.
- [1]H. T. Kung et al., "Systolic Arrays (for VLSI)"
- [2]N. P. Jouppi et al., "In-Datacenter Perfomance Analysis of a Tensor Processing Unit"
- [3]Xilinx, "Zynq DPU Product Guide"
- [4]M. Vestias et al., "Parallel Dot-Products for Deep Learning on FPGA"
- [5]Xilinx, "7 Series DSP48E1 Slice User Guide"
- [6]Xilinx, "UltraScale Architecture DSP Slice User Guide"
- [7]Xilinx, "7 Series Configurable Logic Block User Guide"
- [8]Xilinx, "UltraScale Architecture Configurable Logic Block User Guide"
- [9]Xilinx, "7 Series FPGAs Memory Resources"
- [10]Xilinx, "UltraScale Architecture Memory Resources"