tolga9009 / elgato-gchd

DISCONTINUED. Reverse engineering the Elgato Game Capture HD to make it work under Linux.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance issues - FIFO / UDP streaming

tolga9009 opened this issue · comments

Note: I've already posted this on Gitter, but I'm adding it here, so we can keep track of this issue.

Sometimes, artefacts are popping up in the video stream (FIFO & UDP). This is due to lost buffers, caused by a demanding / slow buffering mechanism.

I've profiled the application using Callgrind (QtCreator 4.0):
callgrind.zip

Problem seems to be std::vector, which dynamically allocates 16384 (DATA_BUF) elements per loop. Also, std::queue< std::vector > seems to have performance issues on several operations.

Possible fix: use static array std::array. Also, we might need to think about a new buffering mechanism aswell.

Cheers,
Tolga

Okay, I've worked on this a bit. std::array offers much better performance in our use case. Still not perfect for UDP mutlicasting, but much better than before. The artefacts are completely gone for FIFO; UDP unicasting is also near perfect.

Gonna push out the changes soon. I'm still not satisfied with performance, I'm leaving this issue open, until performance is perfect.

I was able to reproduce this error on my Mac machine, featuring an Intel Xeon E3-1225v3. This is not due to performance, but due to an unefficient writer / reader thread algorithm. I also saw performance issues on an system featuring the Intel Core i3-4005U CPU. Definitely needs further optimizations.

Writer thread is much faster than reader thread. While writer thread receives a series of data arrays from the device, it doesn't let reader get in between and read from the queue. This causes some queue items getting lost, leading to glitches / artefacts.

Maybe, we shouldn't use std::queue at all and keep things simple and fast, by just writing and reading from an array, using std::mutex, synching threads with a std::condition_variable and a bool newDataAvailable.

As thread programming is not the easiest thing in life, I'm thankful for any suggestions (like always).

I've came up with some ideas on this issue.

After installing the latest official Elgato drivers on my Windows machine and investigating USB traffic, I've noticed, that they've set up a higher buffer size. Instead of the former 16384 (0x4000), they're now using 61440 (0xf000). A quick test showed me, that getting 16384 bytes from the devices takes around 700µs, while getting 61440 bytes takes around 2000µs. This leads to less needed locks and less loops, thus beeing more efficient. Downside is: a greater buffer means more lag.

Also, instead of locking, reading and popping single elements from the queue, I came up with the idea of moving over the entire queue to the reading thread on thread wakeup, by swapping out the current queue with an empty queue (using std::queue::swap). This way, the reading thread will be able to process the whole queue, without requesting a lock for every single queue item.

Another idea is to implement something similar to the old behaviour, without using threads at all. But instead of giving write(toFifo / toDisk / toSocket) infinite amount of time to block (which makes the device unresponsive, when it blocks too long), set a maximum. Writing 61440 bytes to FIFO takes around 70µs, with some peaks, whenever the reader slows down. So, we could be fine without using threads at all.

Update on this: I'm now sticking with the old behaviour. However, peaks need to be taken care of. Sometimes, the reader slows down and we have new data arriving. There is need for some type of temporary bounded queue, so we don't get corrupted output / artefacts etc.

It's going to be a bit complicated, but I'm eventually getting it sorted out. Once this bug is fixed, I will update to version 0.2.0.

Commit 1b26b5d should fix FIFOs.

I've completely redesigned FIFO handling. My personal tests were successful, but I need broader testing. Please checkout master branch, test and post your feedback here. Also, please post your system details (CPU, RAM, GPU, OS / Distro). I will mark 0.2.0 as soon as few people tried it out successfully.

Cheers,
Tolga

Commit 1b26b5d totally breaks Mac OS X support. Guess, it's not going to be fixed this way...

Reimplementing the old behaviour, without polls and stuff will leave the device in an unresponsive state, as soon as the FIFO get's paused. Closing FIFO is no problem though. The official drivers are able to recover from such an unresponsive state - so we can reverse-engineer it in the future.

On the other hand, trying to workaround #13 created this issue, which is far worse than the original issue. So, the best thing to do is: reverting changes and reopening issue #13.

Update: ee37379 reimplemented a blocking FIFO.