ikwzm / udmabuf

User space mappable dma buffer device driver for Linux.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cache coherency?

JishinMaster opened this issue · comments

Hi,

I am trying to read/write to/from a PL BRAM with a CDMA on a ZynqMPSoc device.
I want to enable cache coherency between the PS and the PL, so I am using the HPC0 port.
I have set AxCache to 0b1111 and AxProt to 0b010 and the lpd_apu (0xFF41A040) to 0x3 to enable inner and outer shareability.

Using the udmabuf with SYNC_MODE_WRITECOMBINE I have the same performances as with no cache coherency : bandwidth to/from BRAM seems okay, but PS to PS memory in the allocated region is not since the region is not cacheable.

When I allocate the memory with mode 0 (i.e neither noncached nor writecombine nor dmacoherent) I have very good PS to PS performances, the same as with malloc, but PS to/from PL performances plummet.

Is there a proper way to enable cache coherency (and snooping) with the udmabuf driver?

Thank you for your help.

Thank you for the issue.

I have some questions.

Q1 Is O_SYNC flag attached when opening udmabuf? Is not it attached?

Using the udmabuf with SYNC_MODE_WRITECOMBINE I have the same performances as with no cache coherency : bandwidth to/from BRAM seems okay, but PS to PS memory in the allocated region is not since the region is not cacheable.

Q2 Is "bandwidth to/from from BRAM" transferred when using CDMA?

When I allocate the memory with mode 0 (i.e neither noncached nor writecombine nor dmacoherent) I have very good PS to PS performances, the same as with malloc, but PS to/from PL performances plummet.

Q3 Is ”PS to/from PL” transferred when transferring using CDMA?

A1 :
Setting /sys/class/udmabuf/udmabuf0/sync_mode to 0 and opening /dev/udmabuf0 with O_SYNC gives the same behavior as letting /sys/class/udmabuf/udmabuf0/sync_mode to the default value of 1 and opening without the O_SYNC flag. sync_mode to 0 and no O_SYNC also gives the same performance.
By default the dma-coherent was 0 (using insmod udmabuf.ko udmabuf0=1048576 udmabuf1=1048576). I have then modified the device tree to set the dma-coherent to 1, which did not change anything.

A2/A3 : I have two BRAM controllers in my design, one per port of my dual port BRAM. The first one goes to the CPU, the second one goes to the CDMA. I have done this so I could the check the values in the BRAM after the CDMA transfer is complete. So yes, the data is transferred in both ways and valid using the CDMA.

Cache coherency(and snooping) seems to be working correctly. What's wrong?

There seems to be some kind of coherency enabled, which is great, but the performances are worse than without coherency activated.
Isn't the cache coherency supposed to give better performances?
The copied sizes are small enough to be in cache, with the snooping enabled the PL should be able to fetch the cache lines through the CCI without going to the DDR, hence faster.
Yet it is not what is observed.

Have you seen such behavior on the boards you are working on?

In my experience, Cache Coherency transfer is delayed slightly because snoop operation enters.

Especially in case of ZynqMP, Cache Coherency transfer runs a long distance with AXI-HPx => CCI => SCU => L2CACHE (or L1 CACHE). In case of transfer without Cache Coherency, it is a short distance with AXI-HPx => CCI => SDRAM - Controller.

In addition, in the case of Zynq, experiments have proved that Cache Coherency transfer using Zynq's ACP (Accelerator Coherency Port) is several ten percent slower than transfer without Cache Coherency using HP (High Performance Port) I will.

Transfer with Cache Coherency is slower than transfer without Cache Coherency, but it will be faster on the whole as there is no need to flush or invalidate Cache.

Thanks a lot for your support‚ it really helps my understanding.

@JishinMaster
Did you successfully set up udmabuf to work with the CDMA?

Did u associate the udmabuf with the BRAM's memory region?

I currently have two udmabufs one for TX DIRECTION the other for RX DIRECTION. I write the udmabuf src physical address to my CDMA's source address register. same for the udmabuf destination physical address. As a result, my system halts after setting the number of bytes to transfer.

Do I need to use the bram's memory region for my udmabuf in the TX DIRECTION?