Performance of Renode/Verilator cosim

Question

Performance of Renode/Verilator cosim

tcal-x opened this issue 3 years ago · comments

We should collect some statistics on Renode/Verilator cosimulation to check whether it is close to what would be expected.

If the CFU was much "smaller" than the CPU, then we would expect Verilator simulation of just the CFU to require much less host time per simulated cycle than for CPU-only or CPU+CFU (PLATFORM=sim) Verilator simulation. However, in some cases such as hps_accel, the CFU is larger than the CPU, so we wouldn't expect much speedup (less than a factor of 2).
- I say 'smaller' in quotes since we are simulating the pre-implementation Verilog, so the work of simulating a particular design might not be proportional to the implementation LUT count.
To get a handle on this, can we measure:
- Host time per simulation cycle for proj_template Verilated CFU (using cosim)
- Host time per simulation cycle for hps_accel Verilated CFU (using cosim)
- Host time per simulation cycle for proj_template CPU+CFU (full SoC simulation using PLATFORM=sim)
- Host time per simulation cycle for hps_accel CPU+CFU (full SoC simulation using PLATFORM=sim)
Also, if the CFU is active only a small percentage of overall execution cycles, then we were expect the Renode/Verilator cosim to be much faster than full-SoC Verilator sim. But if the CFU is active the majority of the cycles, then we wouldn't expect much speedup due to this factor.

Can we print out the total number of cycles that Verilator simulates in Renode/Verilator cosim? Then this can be directly compared to actual execution on the board to get the total execution cycle count. This will give us an idea of the fraction of original execution cycles that the CFU needs to be simulated.

Finally, is the Verilator-generated C++ always generated so that it's capable of dumping waveforms? We should see how much faster simulation is if we disable waveform generation in the generated code.

FYI @alanvgreen

Tim 'mithro' Ansell · Answer 1 · Tue Oct 05 2021 05:44:05 GMT+0800 (China Standard Time)

I think wall time to run tests is probably a much simpler thing to examine?

With Renode+CFU you should also be able to use multiple CPUs at once (atleast one for Renode, one for Verilator).

TCal · Answer 2 · Tue Oct 05 2021 11:14:45 GMT+0800 (China Standard Time)

In chat, @PiotrZierhoffer suggested reducing the value on this line: https://github.com/google/CFU-Playground/blob/main/scripts/generate_renode_scripts.py#L91.

I tried it, measuring wall clock time for one hps_accel inference. Baseline time was 1:22. Reducing the value by 1000x reduced time to 1:19. Reducing it by another 1000x reduced the time to 1:17.

Piotr Zierhoffer · Answer 3 · Tue Oct 05 2021 19:09:19 GMT+0800 (China Standard Time)

We are also about to release changes that get rid of the ticking completely. They passed the internal review already, so we're getting there soon.

Piotr Zierhoffer · Answer 4 · Tue Oct 05 2021 19:11:21 GMT+0800 (China Standard Time)

@mithro the execution is not really parallel here - as we execute a single instruction, we let the CFU calculate everything and then return to main Renode.

Piotr Zierhoffer · Answer 5 · Mon Oct 18 2021 21:39:01 GMT+0800 (China Standard Time)

@tcal-x @mithro latest changes by @robertszczepanski , pulled in with #301 , should already improve the performance here.

Alan Green · Answer 6 · Wed Dec 22 2021 05:12:24 GMT+0800 (China Standard Time)

I'm finding the cosim very slow with hps_accel and GATEWARE_GEN=2. I think this is because there is quite a bit of free-running logic in the gen2 gateware.