axis_fifo.v: odd behavior?

Question

axis_fifo.v: odd behavior?

joft-mle opened this issue 2 years ago · comments

Hi Alex,

while working on Corundum Issue 122, we (foremost @sessl3r, @andreasbraun90) have done a quick and dirty simulation of the axis_fifo.v using exactly the parametrization found in mqnic_egress.v and found apparently odd behavior:

The input to the FIFO is shown at the top of the list of signals, the output is shown below it. It seems like the last word with offset 26 is not output by the FIFO as early as expected: After having output the offset 25, the output side's ready is kept high and thus we would expect the offset 26 to show up right after the 25, in the next cycle? Instead the 26 is appearing as soon as the offset 27 has been ack'ed (valid + ready) on the input side. Experiments seemed show that the output of the 26 is dependent on the next input, delaying the input (of the 27 in the above case), delays the output of the 26.

Do we overlooked something?

It remains unclear whether this has actually something to do with Corundum Issue 122 or not.

Alex Forencich · Answer 1 · Sat Oct 29 2022 02:46:41 GMT+0800 (China Standard Time)

Not sure. Post your testbench and I'll take a look. If that behavior isn't some odd testbench problem, then it looks like it could be some sort of a bug in the FIFO.

David Epping · Answer 2 · Mon Oct 31 2022 20:35:47 GMT+0800 (China Standard Time)

Hi Alex,

thanks for taking a look. A slightly modified version of the testbench is attached. I used Vivado 2022.1 simulator to run it.
With the modifications (some more input beats and a ready toggle at the fifo output) beats are not only delayed, but seem to be actually lost (csum_offset 26 in this case).
tb_axis_fifo_tx_checksumming_error.v.txt

Alex Forencich · Answer 3 · Tue Nov 01 2022 06:11:03 GMT+0800 (China Standard Time)

I think your testbench may have some issues. Blocking vs. non-blocking assignments is extremely critical in testbench code; I have made the same mistake myself. What I have found works the best is to use = in the clock generation and <= everywhere else. With that convention, it seems to simulate correctly.

David Epping · Answer 4 · Tue Nov 01 2022 15:59:13 GMT+0800 (China Standard Time)

Hi Alex,
thank you for taking a look so quickly. You are right. We did try all combinations of RHS and LHS delay and blocking and non-blocking assignments for the clock generation, but missed to do the same for the stimulus assignments.
The good thing is, we do see the correct number of beats into and out of the FIFO in hardware, and it would have been hard to explain with this simulation result. We will continue to look for the cause of the data corruption we observe in hardware.
Thanks again for your time and sorry for bothering you,
David

Alex Forencich · Answer 5 · Tue Nov 01 2022 16:17:41 GMT+0800 (China Standard Time)

No problem. These FIFOs are used all over the place, so if they had a serious bug, it probably would be causing other issues. But, since they are used all over the place, any possibility of a bug needs to be properly investigated.

Hopefully we can get to the bottom of what's going on here. My other thought on this issue is maybe there is something strange going on with the memory primitives, like perhaps a difference is read enable behavior or something along those lines. Do you know if there is any relation to the FIFO filling up?

David Epping · Answer 6 · Tue Nov 01 2022 16:25:33 GMT+0800 (China Standard Time)

So far we could not correlate it with the fill level, we do however see that the input is never not ready. Because of pipeline stages it may be possible the actual FIFO is still full at some point.
Yes, a difference in read-before-write / address conflict resolution between Stratix 10 and MPSoC primitives is an area to look at. I have not looked at the code in detail, but would that simultaneous same address access be something that can happen if a full FIFO is read and written to in the same clock cycle, or does your logic avoid this?

Alex Forencich · Answer 7 · Tue Nov 01 2022 16:39:28 GMT+0800 (China Standard Time)

If the s_axis_tready never falls, then the FIFO is never full. The output pipeline is not related to s_axis_tready in any way, s_axis_tready is computed only from the read and write pointers.

Ostensibly, there should never be an address conflict in the FIFO. The only time the pointers point at the same location is when the FIFO is either completely full or completely empty. When the FIFO is empty, the RAM is read continuously, but the data is not actually marked as valid until the read-side logic gets the updated write pointer, one cycle after the write takes place (so the data that is read during the write is discarded). When the FIFO is full, no writes should take place until after the data has been read and the read pointer updated.

David Epping · Answer 8 · Thu Nov 03 2022 00:52:54 GMT+0800 (China Standard Time)

Hi Alex,

today we performed additional debug on actual hardware (Stratix 10 MX dev kit) and were able to get closer to the root cause.

We added an optional debug shift register to axis_fifo.v, shifting one of the input data bits through with every input data beat, and comparing it to the current data output beat. This way we can detect data corruption of that bit in the FIFO. The modified file is attached.

We enabled this debug for tx_csum_cmd_fifo on bit 16 (checksum enable).
.DBG_ENABLE(1),
.DBG_TDATA_INSPECTION_BIT(16),

Triggering on assertion of dbg_tdata_error in a dual port MQNIC set up for bridging in Linux.
As you can see in the attached screenshot of a signaltap VCD export shown in gtkwave (1 sec = 1 clock cycle), The FIFO at this point holds 1 entry, and for a short time 2 entries, and then becomes completely empty.
The last data beat written to the FIFO (0x00000) is however not the last value read from it (0x12232). This causes the false TCP checksum calculation we observe.

The problem disappears when either placing a signal tap or a syn_preserve attribute on the m_axis_pipe_reg signal.
So it seems to be related to this register being merged into the MLAB or not (we see it in the netlist view as being in MLAB or not).

At this point we have not further analyzed whether this undesired behavior is because the HDL allows too much freedom to the Quartus tool, or this is a tool bug not generating the output register enable signal of the MLAB correctly.
axis_fifo.v.txt

Best regards, David

Alex Forencich · Answer 9 · Thu Nov 03 2022 03:34:58 GMT+0800 (China Standard Time)

Huh, that's really interesting. I suspect that this ultimately boils down to a tool bug. Either the MLAB doesn't support the specific functionality that the HDL is asking for with respect to the pipeline register, or there is some sort of bug (silicon bug, incorrect timing data, etc.) that effectively means that the tool should not be merging the registers, or should take some other action (adjust routing to fix setup/hold time, etc.). I'm thinking maybe there is a glitch on the read enable or pipeline clock enable, and the tools aren't taking the appropriate action.

David Epping · Answer 10 · Tue Nov 08 2022 00:40:13 GMT+0800 (China Standard Time)

Just a quick additional update: Forced to M20K the FIFO does not show this error and the syn_preserve is not required. This configuration meets timing in our design (syn_preserve on MLAB does not).