Design and simulation of systolic matrix multiplication kernel and deploy the circuit on an FPGA board. Generate input matrices on the ARM CPU of the FPGA boad, send it to the systolic core, and read the result back to the ARM CPU for validation. PYNQ FPGA board is used for this project, which is a user-friendly FPGA+ARM SoC (System-on-Chip) that allows you to run an embedded Linux stack on the ARM CPU and use the FPGA as an accelerator. Pynq board also ships with user-friendly Python APIs which we will use for programming and interacting with the FPGA.
Specific tasks are below:
We supply the input matrices m0
and m1
in row-serial and column-serial fashion.
For FPGA implementation, each input matrix is stored across multi-ported memory banks. We create banked memories using block partitioning. Each memory bank will feed data into a corresponding lane of the systolic array. We will create N banks (row-wise) for m0
to the left of the systolic array, and N banks (column-wise) for m1
at the top of the systolic array.
We will collect the result m2
of the matrix multiplication from the right side of the array.
This is shown in the figure below:
Hence, PE[0][0], PE[0][1], PE[0][2] and PE[0][3] will read m0
from Bank0 on the left and write the result m2
elements to Bank 0 on the right. All columns of m1
will feed this row in parallel from all the banks at the top.
The banked memory structure has been provided to you as part of the lab infrastructure. You only need to implement the ability to shift out m2
serially per lane of the systolic array into the banks. To do this, you have to make the following changes to your code:
The testbench will simulate banked memories and be primarily responsible for sending m0
and m1
test matrix data into the array, gathering the result from m2
and helping you verify correctness of operation. We use $readmemh
function to read binary matrix data into memories implemented in the test bench. The testbench
creates individual data streams for each memory bank to load data serially into the systolic array. In the end, the m2
memory data is used to assist correctness
checks. Thet testbench abstracts the input and output ports to use AXI-Streams for integration with ARM CPU on the Pynq board.
The testbench also generates enable signals for your counters based on values of M and N. T The testbench uses the data and valid hooks from you design to write serially into banked memories of m2
.
Run make test M=8, N=8
to better visualize input and output together, and confirm if it matches expected result.
Matrix 0 (m0) is
[[3. 1. 8. 1. 3. 5. 6. 0.]
[3. 5. 8. 9. 0. 0. 5. 7.]
[4. 4. 4. 1. 4. 0. 9. 8.]
[1. 6. 0. 1. 8. 7. 3. 6.]
[0. 1. 1. 3. 9. 2. 4. 0.]
[8. 1. 0. 6. 8. 2. 2. 0.]
[4. 2. 0. 3. 9. 3. 3. 6.]
[8. 1. 7. 3. 1. 1. 9. 2.]]
Matrix 1 (m1) is
[[7. 9. 6. 0. 4. 6. 3. 5.]
[2. 5. 1. 8. 1. 6. 0. 4.]
[5. 5. 0. 2. 3. 6. 3. 5.]
[4. 7. 2. 4. 3. 9. 7. 5.]
[0. 1. 7. 9. 8. 0. 5. 0.]
[9. 8. 6. 0. 3. 0. 1. 7.]
[6. 7. 1. 2. 8. 8. 0. 3.]
[7. 8. 7. 1. 6. 7. 8. 7.]]
Answer is
[[148. 164. 78. 67. 127. 129. 60. 117.]
[186. 246. 95. 109. 150. 266. 152. 184.]
[170. 214. 123. 106. 187. 209. 115. 144.]
[146. 179. 157. 136. 158. 117. 105. 134.]
[ 61. 84. 86. 111. 123. 71. 71. 50.]
[112. 157. 131. 108. 137. 124. 108. 94.]
[131. 169. 158. 121. 168. 129. 129. 115.]
[182. 221. 91. 63. 158. 209. 88. 142.]]
Your answer is:
[[148. 164. 78. 67. 127. 129. 60. 117.]
[186. 246. 95. 109. 150. 266. 152. 184.]
[170. 214. 123. 106. 187. 209. 115. 144.]
[146. 179. 157. 136. 158. 117. 105. 134.]
[ 61. 84. 86. 111. 123. 71. 71. 50.]
[112. 157. 131. 108. 137. 124. 108. 94.]
[131. 169. 158. 121. 168. 129. 129. 115.]
[182. 221. 91. 63. 158. 209. 88. 142.]]
##########
Thank Mr. Goose
##########
RTL implementations of the following designs with associated interfaces are created
clk
: 1 bit input : This is the clock input to the modulerst
: 1 bit input : This is a synchronous reset signalinit
: 1 bit input : This is the init signal that flushes the accumulatorin_a
: D_W bits input : This is the first PE operand.in_b
: D_W bits input : This is the second PE operand.out_a
: D_W bits output : This is the output that streams out registered in_aout_b
: D_W bits output : This is the output that streams out registered in_bin_data
: 2*D_W bits input : This is the input stream ofm2
matrix datain_valid
: 1 bit input : Valid signal forin_data
out_data
: 2*D_W output : This is the output stream ofm2
matrix dataout_valid
: i bit output : Valid signal forout_data
clk
: 1 bit input : This is the clock input to the modulerst
: 1 bit input : This is a synchronous reset signalenable_row_count_m0
: 1 bit input : Enable counter operation for the cascaded counter (row counter ofm0
).m0
: D_W bits x N bits input :m0
's data lane to feed into systolic arraym1
: D_W bits x N bits input :m1
's data lane to feed into systolic arraycolumn_m0
: $clog2(M) bits output : Column pointer generated by counter associated withm0
matrixrow_m0
: $clog2(M/N) bits output : Row pointer generated by counter associated withm0
matrixcolumn_m1
: $clog2(M/N) bits output : Column pointer generated by counter associated withm1
matrixrow_m1
: $clog2(M) bits output : Column pointer generated by counter associated withm1
matrixm2
: 2*DW bits x N bits output: Data us form2
. Each channel connected to a separate systolic lane.valid_m2
: N bits output : Valid bus form2
. Each channel connected to a separate systolic lane.
The project is simulated on linux environment. The server details need to be provided in the Makefile
To compile and simulate test your module, simply type make modelsim M=4 N=4
. Modelsim will start in GUI mode.
If you prefer iverilog and gtkwave (OSS tools), you can use make iverilog M=4 N=4
instead.
To avoid using GUIs, you can use make iverilog-txt M=4 N=4
or make modelsim-txt
Make rules.
[Simulation] functionally test of the design can be done end-to-end from matrix multiplication operation using our Python test wrappers, use make test M=<val> N=<val>
. This will first launch the iverilog tool to generate the output memory snapshot of the resultant matrix. Consequently, a python script comparing your
output memory with the correct result will let you know if your implementation is functionally correct.
To map the design to the board, a set of wrappers and Tcl scripts for Vivado compilation are provided. Please invoke make vivado M=4 N=4
to generate the bitstream.
Once the bitstream is generated, implementation can be tested on the board with make board M=4 N=4
command.