raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance regression when using vcsm-cma instead of vcsm, with zero copy MMAL

malc0 opened this issue · comments

Describe the bug

When using MMAL_PARAMETER_ZERO_COPY in combination with vcsm-cma, using the cpu to read from frames captured by a camera in MMAL_ENCODING_I420 format slows down significantly -- taking over 30x longer in a particular benchmark, compared to using the previous vcsm implementation.

In addition to the greater time taken, the CPU usage reported by top is higher (~20% instead of ~2%). Removing the calls to set MMAL_PARAMETER_ZERO_COPY regains the faster timing, but CPU usage remains high.

Steps to reproduce the behaviour

The problem can be seen by applying the patch at https://gist.github.com/malc0/0a9ee21fd92ecc1e37a18fa6507b069e to RaspiVid.c from the current (54fd97ae4066a10b6b02089bc769ceed328737e0) userland repository, and comparing the output of running the resulting raspivid binary with '-o foo.h264'. Using the final 5.4 kernel from the firmware repository (8cd76653b88939baf25c3f9d9ce90657bcc19b76) the time to read the first 20 rows of a 1920 pixels-wide image is ~200 microseconds. Renaming /dev/vcsm to something else results in the vcsm-cma mechanism being used, and the same delay is typically greater than 6000 microseconds. Using the most recent 5.15.x firmware commit (494eb71e5adfca31ec65dd535fce73de3c7c2efa) shows similar times for vcsm-cma, but comparison to vcsm isn't possible.

Device (s)

Raspberry Pi 3 Mod. B+

System

OS: Raspbian GNU/Linux 11, dist-upgraded from jessie originally installed using https://github.com/debian-pi/raspbian-ua-netinst

vcgencmd version: Dec 12 2022 12:00:07
Copyright (c) 2012 Broadcom
version ed6f6b8fcdc6476410b9cf75d141633461d34bdd (clean) (release) (start_x)

uname -a: Linux localhost 5.15.83-v7+ #1607 SMP Thu Dec 15 12:55:05 GMT 2022 armv7l GNU/Linux

Logs

No response

Additional context

cmdline.txt:
dwc_otg.lpm_enable=0 dwc_otg.fiq_enable=0 dwc_otg.fiq_fsm_enable=0 console=ttyAMA0,115200 root=/dev/mmcblk0p2 rootfstype=ext4 rootwait

config.txt:
start_x=1
gpu_mem=256
dtoverlay=cma,cma-96
dtoverlay=disable-bt

commented

vcsm supported cached memory on the ARM side which vcsm-cma does not.
Accessing buffers as individual pixels will therefore be slower as each one will require a round-trip to SDRAM, whilst longer bursts will perform reasonably.

Adding support for ARM-side caching to vcsm-cma is trickier as it's based on the DMA APIs for allocation, and those tend to assume non-cached as they will be accessed by hardware.

Thanks for the information. It seems first doing a memcpy out of buffer->data to a temporary buffer, before accessing the pixel values, is significantly faster. But not as quick as vcsm, or not setting ZERO_COPY in the first place.

Is the non-ZERO_COPY path simply avoiding vcsm-cma, or is it using some more efficient method to copy blocks of data than memcpy?

commented

Non zero-copy uses DMA to copy from a buffer in gpu_mem to a buffer in ARM memory.
vcsm would map the gpu_mem buffer into ARM memory, but was a huge mess and couldn't import ARM memory (dmabuf support). It also needs to be incredibly careful over cache management.
vcsm-cma allocates from the CMA (Contiguous Memory Allocator) heap, and then maps the buffer into the gpu. It uses dmabufs throughout.

There is /dev/dma_heap/linux,cma that will allocate ARM cached CMA buffers, and those can be imported into vcsm-cma and MMAL via vcsm_import_dmabuf. I did look at reworking the userland vcsm library to use the dma_heap, but it was a low priority.

commented

For examples:

Many thanks for the hints/examples. The straightforward integration of /dev/dma_heap/linux,cma in https://gist.github.com/malc0/f58c63276d7202a21487dfdf667819f0 seems to work for my test-case (after tweaking the /dev/dma_heap permissions), and goes a long way to restoring performance.

Two further questions:

  1. my memory access benchmark is in fact now faster than when using the old vcsm mechanism, but CPU usage is ~doubled (~3% with vcsm, to ~7% with patched vcsm-cma). This is entirely tolerable, but is it expected?
  2. I've not seen any disadvantage to closing the dma_heap FDs as in the patch, but this approach isn't taken in the examples you gave; ought I to be keeping them open?