Slow framebuffer copy from GPU memory to SPI bus in fbcp hurts Adafruit's PiGRRL Retropie project

Question

Slow framebuffer copy from GPU memory to SPI bus in fbcp hurts Adafruit's PiGRRL Retropie project

juj opened this issue 7 years ago · comments

Hi all,

Poking around a bit with a Retropie DIY handheld project that was made hugely popular by Adafruit in their PiGRRL series of Raspberry Pi projects. It looks like the display controller, described at https://learn.adafruit.com/running-opengl-based-games-and-emulators-on-adafruit-pitft-displays/pitft-setup, utilizes this fbcp code. Unfortunately this causes some performance and stuttering issues, that are described for example in this Waveshare 3.5" Raspberry Pi Screen Review video from RetroManCave.

Doing some napkin math, theoretical SPI transfer rate is mentioned as 125MHz for the Pi, but the BCM2835 document states that "the practical SPI clock will be lower as the I/O pads can not transmit or receive signals at such hig speed". The Raspberry Pi SPI document also lists the same, and the next divisor after that is 62.5MHz. Booting the Pi on a PiTFT 2.8" display in that data rate mode seems to work, so that would give a theoretical ceiling of 62.5MHz / 320 / 240 / 16 = 50.86 frames per second. The current implementation does not seem to get even close to that.

In a later video, the author migrated to using HDMI, which is described in this Elecrow 5 Inch LCD Review RetroPie & Raspberry Pi video. It's sweet that using a HDMI-based device will work great, but such a shame that Adafruit articles have made these SPI-based displays so popular, so one wonders if there would be something that could be done to improve the performance of SPI-based displays as well.

Peeking into the framebuffer copy code in this repository, it looks like this

while (1) {
    ret = vc_dispmanx_snapshot(display, screen_resource, 0);
    vc_dispmanx_resource_read_data(screen_resource, &rect1, fbp, vinfo.xres * vinfo.bits_per_pixel / 8);
    usleep(25 * 1000);
}

is the culprit. Someone already experimented with just removing the sleep in issue #12, but that does not seem to be the root problem here. I presume the sleep was added to throttle the updates to 50Hz to match to a refresh rate that a display used during development was using, or just to give the CPU a bit of a breather, or due to a lack of frame ready signaling in this loop to be able to throttle at least in some manner.

I have some experience with writing driver code for an SPI-based display on Arduino, but no much previous exposure to Raspberry Pi, so this research has been a bit of a poking around various resources type of activity. Thinking about if this kind of "push display data from VideoCore to a SPI-based display" output method could be optimized, I was searching through different resources, so writing down notes here to centralize a bit.

\1. The above loop has an issue that it does not know when a new rendered frame has actually been finished (ready for presentation) by the GPU. Given that the GPU will likely produce frames with varying rates in a push manner, depending on the workload, if there was a way to wait for the GPU to finish a frame, or get a signal when it does, that would allow removing the sleep. Something like:

while (1) {
    ret = vc_dispmanx_snapshot(display, screen_resource, 0);
    vc_dispmanx_resource_read_data(screen_resource, &rect1, fbp, vinfo.xres * vinfo.bits_per_pixel / 8);
    sleep_to_wait_for_videcore_gpu_to_finish_rendering_its_next_frame();
}

might be an improvement. I was not able to find resources to know whether such a functionality would be exposed by the driver though. There does seem to exist a function vc_dispmanx_vsync_callback, but I understand that would be tied to a connected HDMI display, and would not play a role here if there is no HDMI display attached. Source code found here:

\2. Reading through ARM Broadcom BCM2835 Peripherals manual, there's a mention of an option of the GPU memory space being mapped to the CPU address space, Page 6:

The VideoCore section of the RAM is mapped in only if the system is configured to 
support a memory mapped display (this is the common case).

If that is the case for Raspberry Pi that it'd be possible to directly address the framebuffer bytes, then the above code might be optimizable by removing the need to "snapshot" a frame after it has been rendered, and directly access the portion of the GPU memory that it has drawn the image to. This would require interaction with double-buffering of the GPU, so that there would exist two framebuffer areas in the GPU memory space, one that the GPU would be rendering to, and another that would be the previous finished frame, that could be used for a read operation by fbcp to copy the data out to the SPI bus. If this was the case, then the code might theoretically look something like:

const char *gpu_framebuffer_memory_1 = 0xF2123456; // Whatever the actual address
const char *gpu_framebuffer_memory_2 = 0xF2789ABC;
while (1) {
    vc_dispmanx_resource_read_data(gpu_framebuffer_memory_1, &rect1, fbp, vinfo.xres * vinfo.bits_per_pixel / 8);
    const char *tmp = gpu_framebuffer_memory_1;
    gpu_framebuffer_memory_1 = gpu_framebuffer_memory_2;
    gpu_framebuffer_memory_2 = tmp;
    sleep_to_wait_for_videcore_gpu_to_finish_rendering_its_next_frame();
}

This would avoid a need to pull down a copy of the frame to be submit to the SPI bus. In this kind of mode though, there would need to be a way to "reserve" or "lock" the access to that framebuffer memory so that the GPU will not start rendering to it while it's still being copied out.

\3. I'm unsure what the mmap portion of the code for the framebuffer ends up behaving like. The fbtft code seems to be what contains the actual code to write out to the SPI bus. The PiTFT from Adafruit is an ILI9341 controlled display, and tracing the code makes it look like it goes down through this code to write the framebuffer out to the SPI. The function spi_sync does seem to be a general implementation that takes into account a lot of different cases to be robust. This makes one wonder whether directly accessing the SPI GPIO pins in the above update loop and bypassing the whole kernel would be possible, and how that would perform in comparison.

\4. However if done manually, dedicating the Pi CPU to bit bang through the SPI GPIO pins would be wasteful (and probably not ideal, even if it was faster). The same BCM2835 PDF document lists there being a dedicated SPI peripheral that can take in full bytes to write, and it manages writing the bits out to the SPI bus. Reading even further, feeding individual bytes to this SPI peripheral using the CPU would also probably be wasteful, since it looks like the DMA unit on the Pi is compatible with the SPI peripheral, and capable of driving a transfer of bytes out the SPI bus via DMA writes. Reading through @notro's excellent work at https://github.com/notro/fbtft/wiki, this may probably be what's already happening with the currently framebuffer mmap path?

\5. The problem with this existing general SPI-based write approach is that it models a generic SPI bus that is not aware of the display's vsync and hsync signals. This results in observable tearing, and to optimize, one would need to do the DMA writes in a way that is aware of the vsync refresh window of the display. The pin diagram of the ILI9340 display shows that there do exist dedicated GPIO pins that signal vsync and hsync intervals, so in theory it would be possible to do the DMA writes at times that would align to vsync, but this would definitely require bypassing the kernel's DMA-SPI stack altogether. This is assuming that the DMA chip of the Pi is reasonably fast, ideally as fast or faster than the CPU of the Pi itself. This is also assuming that the DMA chip would have the capability to address the GPU framebuffer memory directly (the CPU being able to address the GPU memory directly might not imply that the DMA chip can - is that the case?)

My understanding is that it would be possible to write a kernel side interrupt to be raised at each edge of the hsync/vsync GPIO pin changes state, to detect hsync increments and vsync enter/exit? In that case, it would be possible to follow the display update beam to time the DMA updates to follow vsync.

\6. Putting all bits together, I wonder if it would be possible to write a kernel module that did a tight DMA-based fast path from VideoCore GPU memory out to SPI by directly accessing the DMA chip itself, and if so, if such an approach could be compatible with other kernel modules? E.g. do there exist other kernel modules that assume an exclusive ownership to access the DMA chip that one could not just talk to it directly at low level from one's own kernel module code? Thinking something like the following, in very high level pseudo:

volatile int gpu_frame_state = 0; // If 1, there is a new rendered GPU frame available that needs pushing out to SPI. If 2, the transfer is in progress.
volatile const char *ptr_to_ready_frame = 0; // ptr to the GPU memory buffer that is pending a write out to SPI bus

volatile int currentScanline = 0; // Current horizontal scanline that the display beam is at

void hsyncCallback() { // Interrupt called on each hsync increment, could also be used to "chase the beam" for more optimized update
	++currentScanline;
}

void vsyncCallback() { // Interrupt called on each vsync blank
	currentScanline = 0;
	if (gpu_frame_state == 1) {
		gpu_frame_state = 2;
		start_dma_write(ptr_to_ready_frame, sizeOfDisplayBytes);
	}
}

void *dmaWritePtr;
void *dmaWriteEndPtr;

void continue_dma() { // Writes out the next 64K block of DMA data
	if (dmaWritePtr < dmaWriteEndPtr) { // Still data to write?
		dma_write(dmaWritePtr, 65536, on_finished=&continue_dma);
		dmaWritePtr += 65536;
	} else { // Finished writing out the whole framebuffer
		gpu_frame_state = 0; // No pending frame to write
		videocore_release_gpu_framebuffer(ptr_to_ready_frame); // Tell the GPU it now has free access to start rendering the next frame to this memory
	}
}

void start_dma_write(const char *ptr, size_t length) {
	dmaWritePtr = ptr;
	dmaWriteEndPtr = ptr+length;
	continue_dma();
}

void on_gpu_has_signaled_frame_ready(const char *ptr) { // Interrupt/callback from GPU whenever it has produced a new frame for display
	// The assumption is that the GPU won't access 'ptr' until it is released back to it
	gpu_frame_state = 1;
	ptr_to_ready_frame = ptr;
}

The above code is quite utopistic in the sense of assuming a direct access to all the chips, and I wonder if such capabilities would exist(?)

Eventually I did migrate to using an external HDMI display to not have to deal with GPU->SPI trouble, though I was left wondering whether the above functionality would be possible to implement to the Pi. Especially since Adafruit has had the (unfortunate?) effect of making SPI-based displays so popular, and that people doing DIY Retropie projects might not realize that the interaction with the SPI bus is currently such a large bottleneck until they build the project.

Does anyone know if the above type of functionality would even already exist outside of fbcp? CC @tasanakorn, @notro, @popcornmix, @pelwell, @ladyada, @PaintYourDragon. Or I wonder if this would be a lost cause given the hardware limits of the SPI bus?

Limor "Ladyada" Fried · Answer 1 · Fri Nov 17 2017 01:37:50 GMT+0800 (China Standard Time)

two notes!

most other non-adafruit 3.5" tft boards use an SPI-to-parallel converter that maxes out at a very low frequency and has way-worse performance than the PiTFT 3.5" which does direct SPI (but...its cheaper so there ya go :)
our fork does windowing which greatly greatly improves performance by not writing the whole display each time, just the changed bits. please try it! https://github.com/adafruit/rpi-fbcp

juj · Answer 2 · Fri Nov 17 2017 22:00:25 GMT+0800 (China Standard Time)

Thanks for the reply, much appreciated!

Tweaking around with performance of the code to not update the whole display, I am convinced that this kind of "display stream compression" type of approach is definitely the way to go. Annoyingly that seems to be at odds against using a DMA transfer, since if the display is split up to multiple separate parts to update, the synchronization with the Data/Control GPIO pin seems to require delimiting the communication into distinct DMA transfers instead of being able to do just one. An approach taken by https://github.com/adafruit/rpi-fbcp might be a middle ground here, e.g. optimize away the head and the tail of the display buffer memory area to update in order to retain a single contiguous DMA transferrable unit.

I'll play around with this idea a bit more to see if there's something that can be squeezed.

Limor "Ladyada" Fried · Answer 3 · Sat Nov 18 2017 00:26:05 GMT+0800 (China Standard Time)

~~i vaguely recall that DMA transfers on the pi cannot be longer than a certain length maybe 16 bytes - maybe more info here msperl/spi-bcm2835#13~~ see notro's link below for correct info

let us know if you have any tweaks or updates!

Noralf Trønnes · Answer 4 · Sat Nov 18 2017 00:26:41 GMT+0800 (China Standard Time)

our fork does windowing which greatly greatly improves performance by not writing the whole display each time, just the changed bits. please try it! https://github.com/adafruit/rpi-fbcp

Yeah, that's the way to go. Memory access is fast and SPI transfer is very slow.
fbtft transfers every memory page (4k) that has been touched, so best to only write changes to the framebuffer.
There's some notes here about how fbtft works: https://github.com/notro/fbtft/wiki/FPS

Annoyingly that seems to be at odds against using a DMA transfer, since if the display is split up to multiple separate parts to update, the synchronization with the Data/Control GPIO pin seems to require delimiting the communication into distinct DMA transfers instead of being able to do just one.

99.9% of the SPI transfer time is updating the framebuffer memory on the display controller. D/C is not toggled during pixel data transfer.
fbtft divides this transfer into chunks of 'txbuflen' size before handing it over to the SPI subsystem.
Then if you use spi bus 0 (zero, not 1 or 2), the spi-bcm2835 driver checks the length and decides if it should use DMA (>~100 bytes) to do the transfer. DMA engine overhead is part of the transfer time, for more see: https://github.com/notro/fbtft/wiki#raspberry-pi-spi-dma

Noralf Trønnes · Answer 5 · Sat Nov 18 2017 00:55:09 GMT+0800 (China Standard Time)

MIPI DSI displays is the way forward to get away from the SPI display limitations. But there's no out-of-the-box solutions for the Pi yet, except for the official touchscreen. Which by the way just got an opensource driver: drivers/gpu/drm/panel/panel-raspberrypi-touchscreen.c

DSI panels that have the setup embedded on the controller itself, doesn't need it's own driver, it can get away with just an entry in drivers/gpu/drm/panel/panel-simple.c

I could only find one other DSI display supported by Linux that might work: https://www.raspberrypi.org/forums/viewtopic.php?f=44&t=197889&p=1236102&sid=a1983393d3f496608aef4540ce6e846c#p1235981

juj · Answer 6 · Sun Nov 19 2017 02:20:08 GMT+0800 (China Standard Time)

This weekend I took some time to poke at this, and have had some success. Here's the results:

video
code

I was thinking of trying out DMAs and/or interrupts, but got disheartened by my inability to work with the VideoCore/DispmanX API the way I would have liked, because I could not find a way to get around the suboptimal GPU polling approach. So even if DMA/interrupt would solve one part of the CPU overhead, this GPU polling part would still remain, and all of a sudden it does not feel interesting to try out anymore.

Thanks for the insights on this, much appreciated!

juj · Answer 7 · Mon Nov 20 2017 06:53:52 GMT+0800 (China Standard Time)

but got disheartened by my inability to work with the VideoCore/DispmanX API the way I would have liked, because I could not find a way to get around the suboptimal GPU polling approach

Ended up tackling this with a compromise histogram-based approach that observes passing frame times and takes a 40% percentile of the frame intervals as a heuristic. That seemed to balance between overpolling and minimizing stuttering. I'm now quite happy with the results of how this turned out, and while power consumption is quite high - at around 120% when there's a lot of activity on the screen, in my use case I've got USB power readily available, so I think I'll leave investigating the interrupts approach for a later date.

Paint Your Dragon · Answer 8 · Mon Nov 20 2017 13:35:02 GMT+0800 (China Standard Time)

Thread of possible interest!
raspberrypi/userland#218 (comment)
There's a vsync callback. Tried out AndrewFromMelbourne's code there, it seems to work.

juj · Answer 9 · Mon Nov 20 2017 22:45:18 GMT+0800 (China Standard Time)

I do implement the vsync callback at https://github.com/juj/fbcp-ili9341/blob/master/fbcp-ili9341.cpp#L38. It does work well at providing a stable 16.666.. msec periodic timer, and I find it works out alright on content that does update at fixed 60fps. Unfortunately if the GPU-utilizing application fails to run at 60fps and occassionally dips, the vsync callback does not see this (since after all it's not a callback for provided frames, but for the HDMI output clock), and this causes variable latency in presenting the frames, leading to stuttering. Some applications produce frames at 50Hz or 20Hz or other rates, but the vsync callback still faithfully fires at 60Hz in these cases, creating jarring stuttering if used. Also when running command line terminal, the caret blinks at 5fps, but the vsync callback still hits at 60Hz, so one still has to analyze the contents of the frames to observe if there were any real changes to the contents that occurred. Calling vc_dispmanx_snapshot(..) + vc_dispmanx_resource_read_data(..) takes about 1msec, so that would be 55msecs of wasted work each second (indeed showing up in top as about ~5.5-7.5% CPU usage "for nothing")

That is why I left that feature out as optional compile option - it's probably good if one aims specifically for an application that runs exactly at that 60fps, though for generic power conserving use, it was not helpful. The current code implements a self-synchronizing heuristic that does fix itself for example to 5Hz refreshes when showing a blinking terminal on the SPI display. It probably too causes a 5.5% excess power consumption overhead or more, but it has the benefit of stuttering less when running e.g. Prince of Persia which refreshes at 10Hz, or NES PAL emulator that refreshes at 50Hz.

Opened an inquiry about this at raspberrypi/userland#440, that would be a definite power saving win.

On an unrelated note, I rewrote the span merging algorithm that the code uses this morning, to improve it to merge spans across scanlines, and that improved throughput performance of the driver by about 25%, and results in a higher cutoff point to when the driver needs to revert from progressive to interlaced screen updates.

Limor "Ladyada" Fried · Answer 10 · Tue Nov 21 2017 03:20:11 GMT+0800 (China Standard Time)

@juj regarding the new algorithm, do you want to submit a PR to the adafruit fork of fbcp? we can test/merge it soon

juj · Answer 11 · Tue Nov 21 2017 21:56:07 GMT+0800 (China Standard Time)

The new code is not quite usable as an update to replace fbcp, because fbcp is more generic and works on any fbtft-based SPI display, but this code is hardcoded for BCM2835 and ILI9341 only, so that would regress users with any other chips. Also power consumption characteristics are very different. fbcp seems to use about 10% max, whereas fbcp-ili9341 is observed to need up to 120% of CPU (1.2 consumed cores). I would recommend users look for this code specifically if the above criteria are not an issue.

juj · Answer 12 · Sun May 27 2018 20:44:50 GMT+0800 (China Standard Time)

I'll close this since the issues have been mostly resolved in https://github.com/juj/fbcp-ili9341/ . (A more recent demo video at https://www.youtube.com/watch?v=dqOLIHOjLq4)