Perform pixel format conversion in worker threads

Question

Perform pixel format conversion in worker threads

Shnatsel opened this issue a year ago · comments

Sergey "Shnatsel" Davidoff commented a year ago

Right now pixel format conversion is performed on the main thread, single-threaded.

Combined with #178 this slows down decoding massively on multi-core systems: the worker threads complete their work quickly, and the entire system ends up just waiting for the main thread to chug through pixel format conversion.

Pixel format conversion is embarrassingly parallel, and should be performed in the worker threads instead.

Johannes Vollmer · Answer 1 · Tue Mar 07 2023 03:38:01 GMT+0800 (China Standard Time)

currently working on it, see compare/parallel-sample-conversion

Johannes Vollmer · Answer 2 · Sat Mar 11 2023 20:40:04 GMT+0800 (China Standard Time)

got it working. required more code than expected, for multiple reasons. see again at compare/parallel-sample-conversion.

The benchmarks don't look very promising though:

master
test read_image_rgba_f32_to_f16 ... bench:  38,371,430 ns/iter (+/- 1,072,023)
test read_image_rgba_f32_to_f32 ... bench:  19,180,770 ns/iter (+/- 978,339)
test read_image_rgba_f32_to_u32 ... bench:  23,938,060 ns/iter (+/- 1,805,984)

parallel conversion
test read_image_rgba_f32_to_f16 ... bench:  35,137,080 ns/iter (+/- 2,895,664)
test read_image_rgba_f32_to_f32 ... bench:  23,335,060 ns/iter (+/- 3,405,254)
test read_image_rgba_f32_to_u32 ... bench:  27,597,810 ns/iter (+/- 2,542,094)

and to show that these comparisons have a large error, here is the rest of the benchmarks that do not do any conversion, and should remain the same in theory:

master
test read_single_image_rle_all_channels               ... bench:  29,500,960 ns/iter (+/- 3,776,853)
test read_single_image_rle_non_parallel_all_channels  ... bench:  37,015,770 ns/iter (+/- 2,773,785)

parallel conversion
test read_single_image_rle_all_channels               ... bench:  22,792,280 ns/iter (+/- 2,726,453)
test read_single_image_rle_non_parallel_all_channels  ... bench:  36,614,260 ns/iter (+/- 2,112,450)

Sergey "Shnatsel" Davidoff · Answer 3 · Sat Mar 11 2023 21:59:34 GMT+0800 (China Standard Time)

Interesting. Have you tried profiling it to see where the time is actually spent? I've described profiling in this article.

Johannes Vollmer · Answer 4 · Sat Mar 11 2023 22:44:21 GMT+0800 (China Standard Time)

this is my next step :)

Johannes Vollmer · Answer 5 · Sat Mar 11 2023 22:48:41 GMT+0800 (China Standard Time)

however, I suspect that conversion couldn't be the bottle neck. Yes, it is done on the main thread, but the decompressed blocks arrive concurrently on the main thread, and each block is converted directly, without waiting for the other blocks. this means, the main thread converts the samples of each block, while the other threads are already decompressing the next blocks. so all threads should be busy most of the time.

but there is one exception: when there is no compression, then only the main thread is busy. wait - just realized that the current implementation in this branch does not use multithreading when the file is uncompressed. I'll need to change that condition before benchmarking again

Johannes Vollmer · Answer 6 · Sun Mar 12 2023 00:11:11 GMT+0800 (China Standard Time)

here's the new numbers:

master @ 66a247789004d974043300daf7bfc57c7d52a459
test read_image_rgba_f32_to_f16 ... bench:  38,287,210 ns/iter (+/- 1,762,883)
test read_image_rgba_f32_to_f32 ... bench:  19,387,110 ns/iter (+/- 1,232,147)
test read_image_rgba_f32_to_u32 ... bench:  23,759,800 ns/iter (+/- 1,588,751)

test read_single_image_rle_all_channels               ... bench:  23,677,890 ns/iter (+/- 3,353,774)
test read_single_image_rle_non_parallel_all_channels  ... bench:  36,565,290 ns/iter (+/- 1,630,368)

parallel
test read_image_rgba_f32_to_f16 ... bench:  35,322,980 ns/iter (+/- 2,195,490)
test read_image_rgba_f32_to_f32 ... bench:  22,905,140 ns/iter (+/- 1,712,346)
test read_image_rgba_f32_to_u32 ... bench:  26,960,230 ns/iter (+/- 2,984,200)

test read_single_image_rle_all_channels               ... bench:  22,866,190 ns/iter (+/- 3,848,219)
test read_single_image_rle_non_parallel_all_channels  ... bench:  36,992,430 ns/iter (+/- 2,598,853)

Sergey "Shnatsel" Davidoff · Answer 7 · Sun Mar 12 2023 00:17:25 GMT+0800 (China Standard Time)

Strange - I see that the case with no conversion has regressed? Is there additional task dispatch overhead for parallel conversion? If that's the case, I suppose format conversion is not worth parallelizing unless we're doing end-to-end parallel decoding without the dispatch overhead. The conversion intrinsics with batched conversion should improve it a lot.

Johannes Vollmer · Answer 8 · Sun Mar 12 2023 00:47:23 GMT+0800 (China Standard Time)

yes, probably some optimization that the compiler couldn't do anymore due to the architectural changes. in theory, it's all static and inlinable.

there is some object cloning going on in this branch, but I can't imagine this making such a huge difference. profiling might show where that might come from

Johannes Vollmer · Answer 9 · Sun Mar 12 2023 01:10:14 GMT+0800 (China Standard Time)

actually, the whole block needs to be allocated once more than before - because we need something to hold the converted values to send them between the threads. before, conversion was done on the fly, without collecting all the converted values. this might be some problem

Sergey "Shnatsel" Davidoff · Answer 10 · Sun Mar 12 2023 01:35:04 GMT+0800 (China Standard Time)

Ah, yes, that's probably the source of the issue. So we should either do fully parallel processing so that these intermediate parts wouldn't have to be sent between threads, or just run them on a single thread.

Johannes Vollmer · Answer 11 · Sun Mar 12 2023 20:12:44 GMT+0800 (China Standard Time)

no, there are no intermediate parts sent across threads. the blocks are decompressed on each thread, and then they are deinterlaced on the same thread. it's just that we need some vector to hold the deinterlaced data, so each thread allocates more than before. but the amount of sending is the same as before.

I'll do some basic profiling nonetheless, just to avoid having really trivial performance problems in the new implementation