enjoy-digital / litex

Build your hardware, easily!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RocketChip with L2 cache - works in litex sim, works on digilent nexys video with cpu-mem-width is 64, breaks with 128

n-kremeris opened this issue · comments

Hi All!

I'm trying to integrate the L2 InclusiveCache from ChipsAlliance (https://github.com/chipsalliance/rocket-chip-inclusive-cache) with a single Rocket core to be used inside the Litex SoC.

I can confirm that it boots fine when used with Litex sim when the rocket is regenerated with the following configurations:

class LitexConfig_linux_1_1 extends Config(
  new WithNBigCores(1) ++
  new WithEdgeDataBits(64) ++
  new WithInclusiveCache() ++
  new BaseLitexConfig
)

class LitexConfig_linux_1_2 extends Config(
  new WithNBigCores(1) ++
  new WithEdgeDataBits(128) ++
  new WithInclusiveCache() ++
  new BaseLitexConfig
)

And below are the cmdlines used to launch Litex sim:

  • litex_sim --with-sdram --sdram-data-width 64 --cpu-type rocket --cpu-variant linux --cpu-num-cores 1 --cpu-mem-width 1 --jobs 12 --threads 12
  • litex_sim --with-sdram --sdram-data-width 128 --cpu-type rocket --cpu-variant linux --cpu-num-cores 1 --cpu-mem-width 2 --jobs 12 --threads 12

Both options boot to bios without issues and successfully pass the built in memory test.

Additionally, i have tried running a tiny baremetal program using the internal verilator based rocketchip simulator (rocket-chip/emulator) and that also works with both the 64 and the 128 bit mem configuration (the application is loaded into the main_ram area)

The design with the included L2 cache works when synthesized for a real FPGA targetting the Digilent Nexys Video board, but only when using --cpu-mem-width 1, which implicitly makes the Litex SoC builder generate a memory width adapter from 64bits to 128bits for LiteDRAM, as the Digilent Nexys video board uses 128bit wide memory bus. Below is the command used to build this version of the bitstream:

./litex-boards/litex_boards/targets/digilent_nexys_video.py --build --cpu-type rocket --cpu-variant linux --cpu-num-cores 1 --cpu-mem-width 1 --sys-clk-freq 50e6 --with-ethernet --bus-data-width 64 --bus-address-width 32 --csr-csv ./csr.csv

 LiteX git sha1: 1520d0f3                                     
                                                              
--=============== SoC ==================--                    
CPU:            RocketRV64[imac] @ 50MHz                      
BUS:            WISHBONE 64-bit @ 4GiB                        
CSR:            32-bit data                                   
ROM:            128.0KiB                                      
SRAM:           8.0KiB
SDRAM:          512.0MiB 16-bit @ 400MT/s (CL-7 CWL-5)
MAIN-RAM:       512.0MiB

--========== Initialization ============--
Ethernet init...
Initializing SDRAM @0x80000000...
Switching SDRAM to software control.
Read leveling:
  m0, b00: |00000000000000000000000000000000| delays: -
  m0, b01: |01111111111111111111111111111100| delays: 15+-14
  m0, b02: |00000000000000000000000000000000| delays: -
  m0, b03: |00000000000000000000000000000000| delays: -
  m0, b04: |00000000000000000000000000000000| delays: -
  m0, b05: |00000000000000000000000000000000| delays: -
  m0, b06: |00000000000000000000000000000000| delays: -
  m0, b07: |00000000000000000000000000000000| delays: -
  best: m0, b01 delays: 15+-14
  m1, b00: |00000000000000000000000000000000| delays: -
  m1, b01: |01111111111111111111111111111100| delays: 15+-14
  m1, b02: |00000000000000000000000000000000| delays: -
  m1, b03: |00000000000000000000000000000000| delays: -
  m1, b04: |00000000000000000000000000000000| delays: -
  m1, b05: |00000000000000000000000000000000| delays: -
  m1, b06: |00000000000000000000000000000000| delays: -
  m1, b07: |00000000000000000000000000000000| delays: -
  best: m1, b01 delays: 15+-14
Switching SDRAM to hardware control.
Memtest at 0x80000000 (2.0MiB)...
  Write: 0x80000000-0x80200000 2.0MiB     
   Read: 0x80000000-0x80200000 2.0MiB     
Memtest OK
Memspeed at 0x80000000 (Sequential, 2.0MiB)...
  Write speed: 38.1MiB/s
   Read speed: 46.6MiB/s

--============== Boot ==================--

However, when using the Rocket variant with 128 bit memory bus width with --cpu-mem-width 2, the bios memtest hangs upon the first write attempt and does not proceed, as can be seen below:

./litex-boards/litex_boards/targets/digilent_nexys_video.py --build --cpu-type rocket --cpu-variant linux --cpu-num-cores 1 --cpu-mem-width 2 --sys-clk-freq 50e6 --with-ethernet --bus-data-width 64 --bus-address-width 32 --csr-csv ./csr.csv
 LiteX git sha1: 1520d0f3

--=============== SoC ==================--
CPU:            RocketRV64[imac] @ 50MHz
BUS:            WISHBONE 64-bit @ 4GiB
CSR:            32-bit data
ROM:            128.0KiB
SRAM:           8.0KiB
SDRAM:          512.0MiB 16-bit @ 400MT/s (CL-7 CWL-5)
MAIN-RAM:       512.0MiB

--========== Initialization ============--
Ethernet init...
Initializing SDRAM @0x80000000...
Switching SDRAM to software control.
Read leveling:
  m0, b00: |00000000000000000000000000000000| delays: -
  m0, b01: |01111111111111111111111111111100| delays: 15+-14
  m0, b02: |00000000000000000000000000000000| delays: -
  m0, b03: |00000000000000000000000000000000| delays: -
  m0, b04: |00000000000000000000000000000000| delays: -
  m0, b05: |00000000000000000000000000000000| delays: -
  m0, b06: |00000000000000000000000000000000| delays: -
  m0, b07: |00000000000000000000000000000000| delays: -
  best: m0, b01 delays: 15+-14
  m1, b00: |00000000000000000000000000000000| delays: -
  m1, b01: |00111111111111111111111111111100| delays: 15+-13
  m1, b02: |00000000000000000000000000000000| delays: -
  m1, b03: |00000000000000000000000000000000| delays: -
  m1, b04: |00000000000000000000000000000000| delays: -
  m1, b05: |00000000000000000000000000000000| delays: -
  m1, b06: |00000000000000000000000000000000| delays: -
  m1, b07: |00000000000000000000000000000000| delays: -
  best: m1, b01 delays: 15+-13
Switching SDRAM to hardware control.
Memtest at 0x80000000 (2.0MiB)...
  Write: 0x80000000-0x80000000 0B   

Based on the fact that litex_sim (and rocket's internal sim) works when using L2 with memory width set to either 64 or 128, I assume there is something strange happening from the Litex.

I would really appreciate some advice on how to narrow down where the problem is, as I would like to avoid having to use the memory width adaptor. Thanks in advance!

commented

I have a very stupid, basic question: how did you "connect" the L2 InclusiveCache and RocketChip sources? I.e., where did you copy the rocket-chip-inclusive-chache git repo w.r.t. the rocket-chip repo, and what (if any) files did you edit to "link" them, before adding "WithInclusiveCache()" to your chosen class in rocket-chip/src/main/scala/system/Configs.scala and running the verilog generation/elaboration, presumably via make ... CONFIG=... ?

I have added it loosely following Chipyard's examples. I have checked out the inclusive cache repository in rocket-chip/src/main/scala/rocket-chip-inclusive-cache, and I have added the withInclusiveCache() mixin to the relevant Litex configurations inside system/Configs.scala as shown on the first post:

class LitexConfig_linux_1_1 extends Config(
  new WithNBigCores(1) ++
  new WithEdgeDataBits(64) ++
  new WithInclusiveCache() ++
  new BaseLitexConfig
)

class LitexConfig_linux_1_2 extends Config(
  new WithNBigCores(1) ++
  new WithEdgeDataBits(128) ++
  new WithInclusiveCache() ++
  new BaseLitexConfig
)

To be sure, I also deleted the generated-src folder that is included in pythondata-cpu-rocket.
I can confirm that the L2 cache is definitely included when building both the internal rocketchip emulator and when running the verilog regenerate code in Litex's update.sh (the compilation output shows the L2 connections being made and it is included in the generated internal DTS, the L2 exists in the verilog, flushing via a flush register works). From my understanding, there should be no additional modifications required for the cache to work (it should be able to directly replace the rocketchips L2 coherence Broadcast Hub).

To change the configuration, i edited the top level Makefrag file 9 CONFIG ?= $(CFG_PROJECT).DefaultConfig (or, as you mentioned, the configuration can be passed to the simulator via make directly)

I did not have to edit any other files inside the RocketChip sources.

Something to note is that without the L2 cache, targeting digilent nexys video, both cpu-mem-width 1 and 2 work fine.

commented
commented

I tried 4-cores and mem-width 2 (128 bit width) on my nexys-video board, and it did hang during memtest. However, it also failed to pass timing at the requested 50MHz frequency. Trying 2-cores now (compilation takes a while), and will try 1 core after that.

But it'd be interesting to know if your build passed or failed timing as well. If timing fails, we have no way to tell with any sort of confidence whether anything related to the L2 cache is or isn't buggy... :)

EDIT: I will then try using my sitlinv-stlnv7325 (v2) board; it can run L2-less rocket cores at 100MHz, so maybe downgrading to 50MHz will allow for enough headroom to pass timing closure, and get some useful test results. Can't use it with e.g. sata (yet), but it should be good for this test :)

commented

@n-kremeris : I managed to test on the stlnv7325, where at 75MHz we're fast enough for the DRAM to work at all (i.e., in the L2-cache-less version), and we can pass vivado timing for the variant where L2-inclusive-cache is enabled.

I could reproduce your observation (having wider-than-64 MEM ports and L2-cache enabled will result in a hard hang any time we try to access any memory, but doing external-to-rocket MEM port width adaptation works).

IOW, L2-cache enabled rocket is happy as long as its MEM port isn't changed from the 64-bit default width.

I've opened chipsalliance/rocket-chip-inclusive-cache#25 to request confirmation of my hypothesis that the inclusive-cache assumes default mem port width, i.e., that this is a bug (well, rather an oversight) that fails to account for use cases where one would want Rocket's externally visible port to be wider than the default...

In the mean time, I should learn to read Chisel... :D

commented

So, I decided to do some measurements:

  1. comparing 64-bit wide MEM port, with vs. without L2 inclusive-cache:
utilization with-L2 without-L2
LUT as logic 83345(40.90%) 78580(38.56%)
Reg. as flip-flop 42641(10.46%) 39983( 9.81%)
BRAM 198(44.61%) 62(14.04%)
  1. comparing 8xWide (512 bit) vs. 64-bit MEM port (the latter using the LiteX-provided width adapter to 512-wide LiteDRAM:
utilization 8x internal 8x via litex
LUT as logic 81831(40.15%) 78580(38.56%)
Reg. as flip-flop 43663(10.71%) 39983( 9.81%)
BRAM * 62(14.04%) 62(14.04%)

* no difference, since both have same (no) L2 cache

As it turns out, having LiteX do the MEM <-> LiteDRAM width adaptation results in fewer resources (LUT, FF) being utilized as compared to when we have the width "conversion" internal to Rocket.

I'm going to have to re-assess the pros and cons (i.e., why did I "trust" or "prefer" Rocket's own internal width conversion over that provided externally by LiteX, and would it make sense to avoid tinkering with Rocket's native port width at all, thus eliminating some of the (way too many) sub-variants in litex-hub)...

EDIT: Test performed using:

litex-boards/litex_boards/targets/sitlinv_stlv7325_v2.py --build --cpu-type rocket \
    --cpu-variant full --cpu-num-cores 2 --cpu-mem-width [1|8] \
    --sys-clk-freq 75e6 --with-ethernet --with-sdcard --with-sata --sata-gen 1
commented

I built bitstream for ecpix5 (native litedram width 128-bit, or 2x) using the 1x (64-bit wide) rocket model, with a width adapter provided by LiteX.

After loading opensbi, kernel, and initrd from sdcard (something that used to work fine with bitstream built using a 2x-wide rocket variant), it gets stuck at "liftoff" -- which IMO means the data it copied from sdcard (using DMA) got corrupted somehow. (I think, and @enjoy-digital please correct me if I'm wrong, that booting over ethernet doesn't use DMA, whereas booting from sdcard or sata does).

Either way, there's something to be said for keeping the wider mem-port variants around until we have a better understanding of what's actually happening.

I'm also going to wait for the inclusive-cache issue to get some responses, hopefully at some point soon... :)

@gsomlo
Thank you very much for your time spent investigating and writing up your results, I will be following the issues around this :)