Libano's Systolic Array Generator

A parametric RTL code generator of an efficient integer MxM Systolic Array implementation for Xilinx FPGAs.

Overview

In a systolic array, there is a rythmic style of computation, in which, at every clock cycle, input data is pumped in, and output data is pumped out. The term systolic is therefore a reference to the functioning of a biological heart^[1].

There are a number of mathematical operations that can be implemented using systolic arrays, but the one in this project is a weight stationary matrix multiplier. Nowadays, systolic arrays are the architectural core of state-of-the-art neural network accelerators, such as Google's DPU^[2] and Xilinx's TPU^[3].

This implementation uses 8-bit integer representation for the inputs, which allows for simultaneosly executing two multiplications in a single DSP^[4]. Furthermore, a time-multiplexing scheme is employed on the DSPs^[5]^[6], allowing them to run twice as fast as the rest of the logic. Thus, overall, each DSP is able to execute four 8-bit integer multiplications per clock cycle. The adders responsible for accumulation are implemented with CLB^[7]^[8] elements, such as LUTs and CARRYs.

Hence, the Processing Elements (PEs) that constitute the array are multiply-accumulate (MAC) units.

Resource Utilization & Performance

Given a systolic array of size NxN:

DSPs: N² DSP48E[1^[5]|2^[6]] (1 for each PE)
BRAMs: 6N RAMB18E[1^[9]|2^[10]] (N for each input/output matrix: ABCD,E,W,X,Y,Z)
Operations/Cycle: 8N² (N² PEs, 2x2xMUL + 4xADD per PE)
Frequency: Will mostly depend target device, but can also depend on N ()
- 8x8/14x14 @ XC7Z020 @ 200MHz
- 8x8/14x14/32x32 @ XCZU9 @ 300MHz

Repository Organization

: Relevant repository documentation.
: Vivado project for a 2x2 array, including testbenches, and an use-case scenario with AXI DMA.
: Python script for generating RTL (edit 'settings.py', run 'main.py', import '/RTL/import_me/*').
: OOC Vivado projects, scripts, and reports for synth/place/route of 8x8/14x14/32x32 arrays on 7000/UltraScale.

References

Extras

About

A parametric RTL code generator of an efficient integer MxM Systolic Array implementation for Xilinx FPGAs.

Languages

Language:SystemVerilog 77.8%Language:Python 22.2%