google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.

Home Page:http://cfu-playground.rtfd.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VexRiscV - SlimoptCfu does not appear to be working

alanvgreen opened this issue · comments

I have a minimal Cfu and with SlimOptCfu on HPS Proto2, the result appears to be always wrong:

/* Generated by Yosys 0.12+13 (git sha1 21fbdb663, clang 11.1.0-4+build2 -fPIC -Os) */

(* \nmigen.hierarchy  = "Cfu" *)
(* top =  1  *)
(* generator = "nMigen" *)
module Cfu(cmd_ready, cmd_payload_function_id, cmd_payload_inputs_0, cmd_payload_inputs_1, rsp_valid, rsp_ready, rsp_payload_outputs_0, reset, port0_addr, port1_addr, port2_addr, port3_addr, port0_din, port1_din, port2_din, port3_din, clk, rst, cmd_valid);
  reg \$auto$verilog_backend.cc:2083:dump_module$2  = 0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/third_party/python/nmigen/nmigen/hdl/ir.py:524" *)
  input clk;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:182" *)
  input [9:0] cmd_payload_function_id;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:184" *)
  input [31:0] cmd_payload_inputs_0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:185" *)
  input [31:0] cmd_payload_inputs_1;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:181" *)
  output cmd_ready;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:180" *)
  input cmd_valid;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/proj/hps_accel/gateware/gen2/hps_cfu.py:32" *)
  reg f = 1'h0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/proj/hps_accel/gateware/gen2/hps_cfu.py:32" *)
  reg \f$next ;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port0_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port0_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port1_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port1_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port2_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port2_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port3_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port3_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:189" *)
  input reset;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:188" *)
  output [31:0] rsp_payload_outputs_0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:187" *)
  input rsp_ready;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:186" *)
  output rsp_valid;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/third_party/python/nmigen/nmigen/hdl/ir.py:524" *)
  input rst;
  always @(posedge clk)
    f <= \f$next ;
  always @* begin
    if (\$auto$verilog_backend.cc:2083:dump_module$2 ) begin end
    \f$next  = 1'h1;
    (* src = "/usr/local/google/home/avg/src/CFU-Playground/third_party/python/nmigen/nmigen/hdl/xfrm.py:519" *)
    casez (rst)
      1'h1:
          \f$next  = 1'h0;
    endcase
  end
  assign rsp_valid = 1'h1;
  assign rsp_payload_outputs_0 = 32'd305419896;
  assign cmd_ready = 1'h1;
endmodule

Every result received by the CPU is 0x13355779

This was the nmigen used to generate the Cfu.

from nmigen import Signal
from nmigen_cfu import CfuBase

from .constants import Constants

class Cfu(CfuBase):
    """Gen2 accelerator CFU.
    
    Assumes working with a slimopt+cfu VexRiscV, which rsp_ready is always true.
    """
    def elab(self, m):
        # for now, everything completes in a single cycle, so can be always ready
        m.d.comb += self.cmd_ready.eq(1)

        m.d.comb += self.rsp_out.eq(0x12345678)
        m.d.comb += self.rsp_valid.eq(1)

        f = Signal()
        m.d.sync += f.eq(1)

def make_cfu():
    return Cfu()

The 'f' signal exists to ensure that a clk signal is generated.

I haven't had time to isolate the bug better yet, but the above failing code is available at https://github.com/alanvgreen/CFU-Playground/blob/slimopt-bug/proj/hps_accel/gateware/gen2/hps_cfu.py

I have verified that the above Cfu works as expected when the default VexRiscV CPU is used.

commented

Hi @alanvgreen , did it meet timing?

commented

The CFU w/ slimopt+cfu worked correctly for me on Arty. Build command:

make  EXTRA_LITEX_ARGS="--cpu-variant=slimopt+cfu" clean prog

I'll try proto2 now.

commented

On proto2, I met timing by a ways (100MHz+), and I do see a symptom similar to yours: result is 0x12345778.

commented

And on proto2, with slim+cfu and your CFU, I get the correct 0x12345678.

commented

When I build slimopt+cfu using Radiant, I see the correct answer 0x12345678.

Ran again - confirmed it meets timing (102.87 MHz). Still producing incorrect output - see uart output below.

Hello, World!

CFU Playground
==============
 1: TfLM Models menu
 2: Functional CFU Tests
 3: Project menu
 4: Performance Counter Tests
 5: TFLite Unit Tests
 6: Benchmarks
 7: Util Tests
 8: SPI Flash Debugging
main> 3

Running Project menu

Project Menu
============
 p: Ping CFU
 v: Verify register tests
 L: test layer 20
 x: eXit to previous menu
project> v

Running Verify register tests
input = c57b29a7, output = 13355779 - FAIL
input = 2b3d0954, output = 13355779 - FAIL
input = 02354ca2, output = 13355779 - FAIL
input = 3727d2b9, output = 13355779 - FAIL
input = 7963b9c5, output = 13355779 - FAIL
input = e3c5de35, output = 13355779 - FAIL
input = f2886d5b, output = 13355779 - FAIL
input = 46f99ee6, output = 13355779 - FAIL
input = b1257a48, output = 13355779 - FAIL
input = d8184f02, output = 13355779 - FAIL
FAIL: 10 failures out of 10 tests
---
commented

I ran with slimopt+cfu, default toolchain, on proto2, with the only change that I specified the 7- grade part LIFCL-17-7UWG72C.

With this run, I matched @alanvgreen 's incorrect result exactly, 0x13355779.

commented

Some other experiments I ran --

  • changing clock divider up and down
  • changing timing fudge factor (currently 1.1) hight and lower

Results -- always a similar error, an LSB of a hex digit being 1'b1 when it should be 1'b0.

Another experiment -- I tried to reproduce on Fomu with the same CPU/CFU, and a similar tool chain -- I did not see the error with Fomu.

commented

I did another experiment with Yosys synthesis and Radiant PnR. The results are even stranger.

make PLATFORM=hps EXTRA_LITEX_ARGS="--cpu-variant=slimopt+cfu --toolchain=radiant --synth-mode=yosys" clean bitstream

Output for interactive test:

Running Run interactive tests
CFU Interactive Test:
  First operand value   > 0
  Second operand value  > 0
cfu_op0(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op1(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)
cfu_op2(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op3(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)
cfu_op4(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op5(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)
cfu_op6(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op7(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)

So with the CFU outputting constant 0x12345678, the output as read by the CPU alternates between two bad values -- and now we sometime see bits incorrectly forced to 1'b0 (before we only saw bits being corrupted to 1'b1). The alternation is not just random -- in all the tests, every successive result flips to the other one.

commented

I ran Radiant again full flow (2.2.1), and saw correct behavior.

commented

Another clue: with the full Oxide flow (yosys/nextpnr-nexus) on this design, I get the correct behavior if I add -noflatten to the arguments to synth_nexus in the Yosys script hps_proto2_platform.ys.

So I suspect that constant propagation / folding from the CFU constant value into the VexRiscv receiving FIFO, particularly across flop boundaries, triggers the issue. This receiving FIFO only exists in the slimopt variant.