ikwzm / udmabuf

User space mappable dma buffer device driver for Linux.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could this be used with V4L2/libcamera buffers on the Raspberry Pi 4 (Arm A72)

octopus-russell opened this issue · comments

Hi,
We've come across this driver as a potential way of passing a userspace dma buffer to V4L2 instead of V4L2's default mmap mode which is rather slow. Here I see someone's done this achieving a 15x speedup: #38
Do you know if this module supports the Raspberry Pi 4? (ARM A72, Debian bullseye, kernel 6.1.21)
Thanks
Russell

Thanks for the issue.

I have only run it on ARM Cortex®-A53 (Xilinx Zynq Ultrascale+ MPSoC) and ARM Cortex®-A9 (Xilinx ZYNQ / Altera CycloneV SoC), I don't know if it works on ARM Cortex®-A72 (Raspberry Pi 4).

It may work on the ARM Cortex®-A72 (Raspberry Pi 4) since it has the same arm64 architecture as the A53.

Please someone give me some information.

udmabuf likely isn't a good way to pass the buffers, but if you're experiencing issues with mmap'ing buffers indeed it's because they are likely in uncached memory.

udmabuf likely isn't a good way to pass the buffers, but if you're experiencing issues with mmap'ing buffers indeed it's because they are likely in uncached memory.

Here is a little explanation about the cache being turned off.

Performance issue with V4L2 streaming I/O (V4L2_MEMORY_MMAP)

Introduction

V4L2 streaming I/O (V4L2_MEMORY_MMAP) is a V4L2 streaming I/O scheme that maps V4L2 buffers allocated in the V4L2 driver (in the kernel) to user space using the mmap mechanism, allowing user programs to access V4L2 This method is used relatively often because it allows direct access to the V4L2 buffers from user space.

However, certain V4L2 drivers had a problem where caching was turned off when mapping to user space with mmap, resulting in very slow memory access and poor performance.

One V4L2 driver that causes this problem is Xilinx's Video DMA.

This topic describes the mechanism.

Mechanism of cache turn-off

There is a problem with the mmap of dma-contig in the V4L2 buffer memory allocator, which in some cases turns off the cache.
Therefore, the cache is turned off in the mmap of the V4L2 driver that employs dma-contig.

Memory allocator for V4L2 buffer

There are three types of memory allocators for V4L2 buffers

  • vmalloc : for V4L2 drivers without DMA
  • dma-sg : for DMA devices supporting Scatter Gather
  • dma-contig : for DMA devices that do not support Scatter Gather

Of these, the last one, dma-contig, is the most problematic.

vmalloc

vmalloc is a memory allocator for V4L2 drivers without DMA.
For example, the V4L2 driver for USB Camera does this; in the case of USB, the USB device driver transfers data to and from the USB device, and the V4L2 driver itself does not directly transfer data to and from the USB device.
Therefore, it allocates memory using vmalloc, which is normally used by the kernel.

dma-sg

dma-sg is a memory allocator for devices with DMA supporting Scatter Gather, which allows DMA transfers even when buffers are not contiguous in physical memory space.
It allocates memory using the Linux kernel's dma_sg API.

dma-contig

dma-contig is a memory allocator for devices with DMA that does not support Scatter Gather. kernel's dma API to allocate memory.
Actually, there is a problem with the mmap of this dma-contig, and the mmap of the V4L2 driver that uses this dma-contig may turn off the cache.

mmap for dma-contig

vb2_dc_mmap()

The mmap for dma-contig is as follows

https://elixir.bootlin.com/linux/v6.1.38/source/drivers/media/common/videobuf2/videobuf2-dma-contig.c#L274

static int vb2_dc_mmap(void *buf_priv, struct vm_area_struct *vma)
{
	struct vb2_dc_buf *buf = buf_priv;
	int ret;

	if (!buf) {
		printk(KERN_ERR "No buffer to map\n");
		return -EINVAL;
	}

	if (buf->non_coherent_mem)
		ret = dma_mmap_noncontiguous(buf->dev, vma, buf->size,
					     buf->dma_sgt);
	else
		ret = dma_mmap_attrs(buf->dev, vma, buf->cookie, buf->dma_addr,
				     buf->size, buf->attrs);
	if (ret) {
		pr_err("Remapping memory failed, error: %d\n", ret);
		return ret;
	}

	vma->vm_flags		|= VM_DONTEXPAND | VM_DONTDUMP;
	vma->vm_private_data	= &buf->handler;
	vma->vm_ops		= &vb2_common_vm_ops;

	vma->vm_ops->open(vma);

	pr_debug("%s: mapped dma addr 0x%08lx at 0x%08lx, size %lu\n",
		 __func__, (unsigned long)buf->dma_addr, vma->vm_start,
		 buf->size);

	return 0;
}

Do not consider buf->non_coherent_mem here.
If buf->non_coherent_mem is TRUE, the buffer is allocated in non-contiguous space.
Therefore, dma_mmap_attrs() will be called if the buffer is allocated in contiguous space.

dma_mmap_attrs()

dma_mmap_attrs() is as follows.

https://elixir.bootlin.com/linux/v6.1.38/source/kernel/dma/mapping.c#L457

int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
		void *cpu_addr, dma_addr_t dma_addr, size_t size,
		unsigned long attrs)
{
	const struct dma_map_ops *ops = get_dma_ops(dev);

	if (dma_alloc_direct(dev, ops))
		return dma_direct_mmap(dev, vma, cpu_addr, dma_addr, size,
				attrs);
	if (!ops->mmap)
		return -ENXIO;
	return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs);
}

On the arm64 architecture, dma_alloc_direct() is normally TRUE, so dma_direct_mmap() is called.

dma_direct_mmap()

dma_direct_mmap() is as follows.

https://elixir.bootlin.com/linux/v6.1.38/source/kernel/dma/direct.c#L555

int dma_direct_mmap(struct device *dev, struct vm_area_struct *vma,
		void *cpu_addr, dma_addr_t dma_addr, size_t size,
		unsigned long attrs)
{
	unsigned long user_count = vma_pages(vma);
	unsigned long count = PAGE_ALIGN(size) >> PAGE_SHIFT;
	unsigned long pfn = PHYS_PFN(dma_to_phys(dev, dma_addr));
	int ret = -ENXIO;

	vma->vm_page_prot = dma_pgprot(dev, vma->vm_page_prot, attrs);
	if (force_dma_unencrypted(dev))
		vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);

	if (dma_mmap_from_dev_coherent(dev, vma, cpu_addr, size, &ret))
		return ret;
	if (dma_mmap_from_global_coherent(vma, cpu_addr, size, &ret))
		return ret;

	if (vma->vm_pgoff >= count || user_count > count - vma->vm_pgoff)
		return -ENXIO;
	return remap_pfn_range(vma, vma->vm_start, pfn + vma->vm_pgoff,
			user_count << PAGE_SHIFT, vma->vm_page_prot);
}

Here, the cache is set by dma_pgprot().

dma_pgprot()

dma_pgprot() is as follows.

https://elixir.bootlin.com/linux/v6.1.38/source/kernel/dma/mapping.c#L415

#ifdef CONFIG_MMU
/*
 * Return the page attributes used for mapping dma_alloc_* memory, either in
 * kernel space if remapping is needed, or to userspace through dma_mmap_*.
 */
pgprot_t dma_pgprot(struct device *dev, pgprot_t prot, unsigned long attrs)
{
	if (dev_is_dma_coherent(dev))
		return prot;
#ifdef CONFIG_ARCH_HAS_DMA_WRITE_COMBINE
	if (attrs & DMA_ATTR_WRITE_COMBINE)
		return pgprot_writecombine(prot);
#endif
	return pgprot_dmacoherent(prot);
}
#endif /* CONFIG_MMU */

Note that the macro dev_is_dma_coherent() is used here.
dma_pgprot() does nothing if dev_is_dma_coherent() is true.
dma_pgprot() returns the return value of pgprot_dmacoherent() if dev_is_dma_coherent() is false.
And pgprot_dmacoherent() returns an architecture-dependent value.
If the architecture is ARM64, pgprot_dmacoherent() returns the same value as pgprot_writecombine().

Conclusion

On arm64 architecture, V4L2 drivers employing dma-contig will turn off cache on mmap.