johannesvollmer / exrs

Hi! I just did some benchmarking on what kind of potential performance gains can be achieved when writing to uncompressed scanline files. I did this by writing an alternative function that writes the metadata and header as per-normal, but manually computes and writes the offset table and lines.

You can see the benchmarks here: master...expenses:exrs:benchmarking. Essentially, the big problem is that scanlines in exr files are stored in a deinterlaced fashion in alphabetical channel order, so you have something like <line header>BBBBGGGGRRRR<line header>BBBBGGGGRRRR. If you have deinterlaced channels as input, you can write files super fast as you're just doing 3 big writes per-line. When the input is interlaced, you have to do deinterlacing, which might take up half of the writing time.

Here are the benchmark results for me:

running 3 tests
test write_scanlines_deinterlaced            ... bench:  26,801,109 ns/iter (+/- 9,537,414)
test write_scanlines_interlaced              ... bench:  56,451,688 ns/iter (+/- 9,093,729)
test write_scanlines_normal                  ... bench:  93,277,199 ns/iter (+/- 18,158,968)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured

I think it's amazing that you took the time to go and investigate! thanks.

there is a batteries included storage type that stores the samples for each channel in a separate vector. It uses memcpy to read and write files. Look for all_channels in the examples, or any_channels. it should be pretty close to what you coded up, but maybe it's not as optimized as yours

exrs/examples/3b_read_all_channels_with_metadata.rs

Line 13 in a3004cc

.largest_resolution_level().all_channels().all_layers().all_attributes()

is your goal to make write_validating_to_buffered public? we can do that

also, we are currently working on performing this interleaving on multiple threads

there is a batteries included storage type that stores the samples for each channel in a separate vector. It uses memcpy to read and write files. Look for all_channels in the examples, or any_channels. it should be pretty close to what you coded up, but maybe it's not as optimized as yours
also, we are currently working on performing this interleaving on multiple threads

Ah, great! I didn't know that there was some work that had already happened for this.

is your goal to make write_validating_to_buffered public? we can do that

At the moment I'm just investigating how fast I can make things. This might become a PR at some point.

Hmm, where does interlacing or deinterlacing happen? The reading path for uncompressed scanlines takes this path:

exrs/src/compression/mod.rs

Lines 194 to 206 in a3004cc

    
           pub fn decompress_image_section(self, header: &Header, compressed: ByteVec, pixel_section: IntegerBounds, pedantic: bool) -> Result<ByteVec> { 
        
               let max_tile_size = header.max_block_pixel_size(); 
        
               assert!(pixel_section.validate(Some(max_tile_size)).is_ok(), "decompress tile coordinate bug"); 
        
               if header.deep { assert!(self.supports_deep_data()) } 
        
               let expected_byte_size = pixel_section.size.area() * header.channels.bytes_per_pixel; // FIXME this needs to account for subsampling anywhere 
        
               // note: always true where self == Uncompressed 
        
               if compressed.len() == expected_byte_size { 
        
                   // the compressed data was larger than the raw data, so the small raw data has been written 
        
                   Ok(convert_little_endian_to_current(&compressed, &header.channels, pixel_section)) 
        
               }

and I don't see a function for this being called anywhere.

I did more benchmarking with AnyChannels! Looks like it's not much slower than my custom implementation.

test write_scanlines_deinterlaced_anychannels ... bench:  27,594,098 ns/iter (+/- 559,496)
test write_scanlines_deinterlaced_custom      ... bench:  23,301,687 ns/iter (+/- 715,609)
test write_scanlines_interlaced_anychannels   ... bench:  45,824,047 ns/iter (+/- 688,379)
test write_scanlines_interlaced_custom        ... bench:  41,103,504 ns/iter (+/- 1,111,189)
test write_scanlines_specificchannels         ... bench:  81,554,311 ns/iter (+/- 1,731,659)

I think something to investigate is why doing a manual deinterlacing step is so much faster than using SpecificChannels. It really shouldn't be.

I changed things such that all data is allocated in the benchmarking closure instead of allowing some if it to allocated outside and things look a lot closer!

test write_scanlines_deinterlaced_anychannels ... bench:  47,718,728 ns/iter (+/- 3,089,806)
test write_scanlines_deinterlaced_custom      ... bench:  43,331,516 ns/iter (+/- 1,696,782)
test write_scanlines_interlaced_anychannels   ... bench:  80,050,009 ns/iter (+/- 2,803,357)
test write_scanlines_interlaced_custom        ... bench:  75,102,131 ns/iter (+/- 1,026,776)
test write_scanlines_specificchannels         ... bench:  80,844,714 ns/iter (+/- 1,955,587)

I think something to investigate is why doing a manual deinterlacing step is so much faster than using SpecificChannels. It really shouldn't be.

It would be great if we could optimized the existing code, awesome :)

Hmm, where does interlacing or deinterlacing happen?

See #178 in the comments. It's part of the SpecificChannels, as you already seem to have found out :)

this part is important:

The final pixel tuples are stored in a temporary vector for that scanline, which contains placeholder values at first. Then for each channel, we step through the source samples in the slice, and for each sample we overwrite that one component of the corresponding pixel in the temporary vector.

it's roughly in this part of the code:

exrs/src/image/read/specific_channels.rs

Lines 290 to 302 in 49fece0

    
           // match outside the loop to avoid matching on every single sample 
        
           match self.channel.sample_type { 
        
               SampleType::F16 => for pixel in pixels.iter_mut() { 
        
                   *get_pixel(pixel) = Sample::from_f16(f16::read(&mut own_bytes_reader).expect(error_msg)); 
        
               }, 
        
               SampleType::F32 => for pixel in pixels.iter_mut() { 
        
                   *get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg)); 
        
               }, 
        
               SampleType::U32 => for pixel in pixels.iter_mut() { 
        
                   *get_pixel(pixel) = Sample::from_u32(u32::read(&mut own_bytes_reader).expect(error_msg)); 
        
               },

Ah, okay cool :)

Honestly I think the conclusion that I've drawn from doing these benchmarks is that this crate is doing things correctly when it comes to performance :) There are a few differences between the existing code and the function I've written, but they mostly come down to my code being a bit more straightforward.

One idea I did test was that if the input rgb(a) buffer is mutable, either a &mut [f32] or Vec<f32> you can deinterlace it inplace using a function like this:

fn deinterlace_rgba_rows_inplace(rgba: &mut [f32], width: usize, height: usize) {
    let mut row = vec![0.0; width * 3];

    for y in 0 .. height {
        let mut i = 0;
        {
            let input = &rgba[y * width * 4 .. (y + 1) * width * 4];
            let (mut blue, mut remaining) = row.split_at_mut(width);
            let (mut green, mut red) = remaining.split_at_mut(width);
            for (((chunk, red), green), blue) in input.chunks_exact(4).zip(red).zip(green).zip(blue) {
                *red = chunk[0];
                *green = chunk[1];
                *blue = chunk[2];
            }
        }
        rgba[y * width * 3..(y +1) * width * 3].copy_from_slice(&row);
    }
}

This only allocates a row's worth of pixels which is quicker. Additionally the data for each row can be written out with a single memory copy (apart from the row headers).

This has good benchmark performance:

test write_scanlines_deinterlaced_anychannels ... bench:  74,560,475 ns/iter (+/- 18,268,405)
test write_scanlines_deinterlaced_custom      ... bench:  64,145,105 ns/iter (+/- 20,254,319)
test write_scanlines_interlaced_anychannels   ... bench: 111,880,132 ns/iter (+/- 23,037,003)
test write_scanlines_interlaced_custom        ... bench: 107,324,443 ns/iter (+/- 11,840,847)
test write_scanlines_interlaced_custom_bgr    ... bench:  77,506,742 ns/iter (+/- 3,902,751)
test write_scanlines_specificchannels         ... bench:  86,368,055 ns/iter (+/- 3,157,508)

This requirement of the input being mutable does make things harder though - if you cloned a &[f32] into a Vec<f32> before using deinterlace_rgba_rows_inplace I'm pretty sure that would cancel out the gains.

thanks! appreciate it. we should not stop to profile though. currently the efforts are focused on doing all the work in multiple threads, including interlacing, which will help :)

unfortunately deinterlacing in-place also requires all channels to have the same sample type. it might be worth to optimize for that since it is the most common case, but in general the library must support arbitrary channels. also in general, the library does not assume any pixel storage, it will not assume that the user stores their pixels in a vector in the first place. that's why the API is built with a callback of the form |x,y| (r, g, b), which works with individual pixels

Note that interlacing and deinterlacing has been optimized a lot in #173, which is not yet part of any stable release. Make sure you're profiling the latest version from git, so that you get the optimized implementations.

Since no action appears to be is needed right now, should this be closed or moved to a discussion?

Thanks for taking the time to tinker, benchmark, and letting us know about your results, @expenses :) I'll enable discussions as a place to collect detailed insights

	pub fn decompress_image_section(self, header: &Header, compressed: ByteVec, pixel_section: IntegerBounds, pedantic: bool) -> Result<ByteVec> {
	let max_tile_size = header.max_block_pixel_size();

	assert!(pixel_section.validate(Some(max_tile_size)).is_ok(), "decompress tile coordinate bug");
	if header.deep { assert!(self.supports_deep_data()) }

	let expected_byte_size = pixel_section.size.area() * header.channels.bytes_per_pixel; // FIXME this needs to account for subsampling anywhere

	// note: always true where self == Uncompressed
	if compressed.len() == expected_byte_size {
	// the compressed data was larger than the raw data, so the small raw data has been written
	Ok(convert_little_endian_to_current(&compressed, &header.channels, pixel_section))
	}

	// match outside the loop to avoid matching on every single sample
	match self.channel.sample_type {
	SampleType::F16 => for pixel in pixels.iter_mut() {
	*get_pixel(pixel) = Sample::from_f16(f16::read(&mut own_bytes_reader).expect(error_msg));
	},

	SampleType::F32 => for pixel in pixels.iter_mut() {
	*get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg));
	},

	SampleType::U32 => for pixel in pixels.iter_mut() {
	*get_pixel(pixel) = Sample::from_u32(u32::read(&mut own_bytes_reader).expect(error_msg));
	},

Benchmarking research: deinterlacing rgb(a) buffers is the slow part