ikwzm / udmabuf

User space mappable dma buffer device driver for Linux.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Are memory barriers necessary when utilizing manual cache management from C/C++ code?

dawithers opened this issue · comments

This whitepaper has a lot of good information for those trying to decide the right route to proceed when wanting to get data back and forth between CPU and PL. One of the things it mentions is the need for memory fences/barriers if you are going to be controlling cache coherency from SW rather than HW.

"In Linux, after each buffer is flushed or invalidated, global memory barrier should be inserted to guarantee no memory accesses are reordered. "

Is there any reason this shouldn't be a concern when using manual cache control with udmabuf? In other words, is there a possibility when using manual cache control with udmabuf, that the compiler might optimize the code in such a way that the read might be performed before the invalidate command makes it to the cache when reading a PL to CPU DMA?

In the whitepaper the give this basically as the deciding factor in ditching the HP AXI with SW coherency and instead utilizing HPC AXI with HW coherency through CCI. They say that HP is faster until you put in the necessary memory barriers.

When manual cache control is performed with u-dma-buf, the memory fences/barriers is automatically performed in the Linux Kernel.

u-dma-buf flushes or invalidates the cache by calling dma_sync_single_for_device() or dma_sync_single_for_cpu().
These functions are APIs of the dma-mapping framework provided by the Linux Kernel.
These functions control the cache depending on the CPU architecture. The memory fences/barriers is also included here.
For example, arm64 cache control is in arch/arm64/mm/cache.S. From this, you can see that the memory barrier instruction is executed after the cache control instruction.

Thanks for your quick response as well as your work on this driver. If I have compiled C++ code that utilizes your example code of opening the sync_for_cpu file and writing a "1" then immediately followed by a read of the given buffer, there's no way that the compiler can optimize the read to be reordered to happen before the sync_for_cpu completes?

If you're worried, it's a good idea to write in sync mode when writing to sync_for_cpu or sync_for_device.

Thanks