google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.

Home Page:http://cfu-playground.rtfd.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance of Renode/Verilator cosim

tcal-x opened this issue · comments

commented

We should collect some statistics on Renode/Verilator cosimulation to check whether it is close to what would be expected.

  • If the CFU was much "smaller" than the CPU, then we would expect Verilator simulation of just the CFU to require much less host time per simulated cycle than for CPU-only or CPU+CFU (PLATFORM=sim) Verilator simulation. However, in some cases such as hps_accel, the CFU is larger than the CPU, so we wouldn't expect much speedup (less than a factor of 2).

    • I say 'smaller' in quotes since we are simulating the pre-implementation Verilog, so the work of simulating a particular design might not be proportional to the implementation LUT count.
  • To get a handle on this, can we measure:

    • Host time per simulation cycle for proj_template Verilated CFU (using cosim)
    • Host time per simulation cycle for hps_accel Verilated CFU (using cosim)
    • Host time per simulation cycle for proj_template CPU+CFU (full SoC simulation using PLATFORM=sim)
    • Host time per simulation cycle for hps_accel CPU+CFU (full SoC simulation using PLATFORM=sim)
  • Also, if the CFU is active only a small percentage of overall execution cycles, then we were expect the Renode/Verilator cosim to be much faster than full-SoC Verilator sim. But if the CFU is active the majority of the cycles, then we wouldn't expect much speedup due to this factor.

Can we print out the total number of cycles that Verilator simulates in Renode/Verilator cosim? Then this can be directly compared to actual execution on the board to get the total execution cycle count. This will give us an idea of the fraction of original execution cycles that the CFU needs to be simulated.

Finally, is the Verilator-generated C++ always generated so that it's capable of dumping waveforms? We should see how much faster simulation is if we disable waveform generation in the generated code.

FYI @alanvgreen

I think wall time to run tests is probably a much simpler thing to examine?

With Renode+CFU you should also be able to use multiple CPUs at once (atleast one for Renode, one for Verilator).

commented

In chat, @PiotrZierhoffer suggested reducing the value on this line: https://github.com/google/CFU-Playground/blob/main/scripts/generate_renode_scripts.py#L91.

I tried it, measuring wall clock time for one hps_accel inference. Baseline time was 1:22. Reducing the value by 1000x reduced time to 1:19. Reducing it by another 1000x reduced the time to 1:17.

We are also about to release changes that get rid of the ticking completely. They passed the internal review already, so we're getting there soon.

@mithro the execution is not really parallel here - as we execute a single instruction, we let the CFU calculate everything and then return to main Renode.

@tcal-x @mithro latest changes by @robertszczepanski , pulled in with #301 , should already improve the performance here.

I'm finding the cosim very slow with hps_accel and GATEWARE_GEN=2. I think this is because there is quite a bit of free-running logic in the gen2 gateware.