johannesvollmer / exrs

Right now format conversion invokes a callback on every pixel access, and is generally oriented towards per-pixel processing:

exrs/src/image/read/specific_channels.rs

Lines 279 to 307 in 49fece0

    
           impl<Sample: FromNativeSample> SampleReader<Sample> { 
        
               fn read_own_samples<'s, FullPixel>( 
        
                   &self, bytes: &'s[u8], pixels: &mut [FullPixel], 
        
                   get_pixel: impl Fn(&mut FullPixel) -> &mut Sample 
        
               ){ 
        
                   let start_index = pixels.len() * self.channel_byte_offset; 
        
                   let byte_count = pixels.len() * self.channel.sample_type.bytes_per_sample(); 
        
                   let mut own_bytes_reader = &bytes[start_index .. start_index + byte_count]; // TODO check block size somewhere 
        
                   let error_msg = "error when reading from in-memory slice"; 
        
                   // match outside the loop to avoid matching on every single sample 
        
                   match self.channel.sample_type { 
        
                       SampleType::F16 => for pixel in pixels.iter_mut() { 
        
                           *get_pixel(pixel) = Sample::from_f16(f16::read(&mut own_bytes_reader).expect(error_msg)); 
        
                       }, 
        
                       SampleType::F32 => for pixel in pixels.iter_mut() { 
        
                           *get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg)); 
        
                       }, 
        
                       SampleType::U32 => for pixel in pixels.iter_mut() { 
        
                           *get_pixel(pixel) = Sample::from_u32(u32::read(&mut own_bytes_reader).expect(error_msg)); 
        
                       }, 
        
                   } 
        
                   debug_assert!(own_bytes_reader.is_empty(), "bytes left after reading all samples"); 
        
               } 
        
           }

Note the get_pixel: impl Fn(&mut FullPixel) -> &mut Sample in the signature.

This prevents the compiler from vectorizing the code. There are vector instructions to turn batches of values of one type into another, but this interface prevents their use.

Lack of vectorization is especially bad for f16 types, which do not exist natively on the CPU, and there are only vector instructions to convert to/from f16. See #177 for details - it's closed in favor of this one, but is still a problem.

For f16 conversions not to be atrociously slow, they need to be batched together explicitly. The half crate provides functions converting slices to/from f16, see e.g. convert_from_f32_slice.

For other types, straightforward conversion loops such as these will be automatically vectorized:

input.iter().zip(output.iter_mut()).for_each(|in_val, out_val| *out_val = in_val.into());

provided that input and output are both slices. You can also convert into a Vec:

let output: Vec<_> = input.iter().map(|in_val| in_val.into() ).collect();

The code for conversion is pretty complex as it is. That's not good, of course, I'll explain why it is already rather complex:

The user of the library can specify the type of pixels they expect by declaring a tuple type, for example (f16, f32, f32, u32, f16, f16). The library will automatically find the channels with the specified names in the file, and extract only the desired channels from the image file. This is done by constructing a tuple of the requested form for each pixel and then looking up the values from each channel, in a linear fashion for cache optimization. Some values may be filled in if the channel was declared optional and is not present in the file. All of that is type safe, and as much as possible is done at compile time.

To achieve this, many intermediate reader objects must be constructed. They are built using recursive types, to avoid having to implement each trait multiple times for all the tuple variations. The recursive types look like this Recursive<T, AnotherRecursive>. For example the tuple type (f16, f32, u32) becomes, on a very abstract level, something like (f16, (f32, (u32, ()))). In reality, would result in something like Recursive<Read<f16>, Recursive<Read<f32>, Recursive<Read<u32>, NoneMore>>> in the end. In the library, we only need to handle those reader types for generic parameters, for example impl X for Recursive<OwnContent, AnotherRecursiveType>.

It was important to me to provide the most ergonomic API, so in the end, it will always be a pixel-by-pixel callback for the user. For that reason, most of the code in the library also had to be in a pixel-by-pixel structure. However, it might be possible to change some of the internal structure to work more in batches.

That's why it's pretty complex unfortunately. I hope this clarifies how the recursive types work. I was not able to find a cleaner solution unfortunately.

What I want to say is that this issue will require some very tricky type management. We should definitely make sure that there are no easier alternatives before we invest the time. For example, we should definitely add some documentation that clearly recommends using f32 where possible.

We could try to allocate an intermediate buffer just for conversion purposes, if it is more performant. This way we might get away with faster conversion without having to change any of the recursive types structures (it is rather annoying).

That does sound like a good idea. Those buffers could even be allocated on the stack - we don't need them to be big.

E.g. for an f16 an [f16; 4] on the stack will suffice to get rid of atrocious f16 performance, although bigger buffers will see bigger gains from vectorization.

We can then iterate over the buffer and run the user-supplied callback over it multiple times.

This is the data layout during the reading process:

In the file, the samples are stored line by line. In each scanline however, the values are not stored pixel by pixel, but instead channel by channel. This means that for each channel, all of its samples for that line are stored in a contiguous slice. This allows for batch conversion for all of the samples of a channel in theory. I think it should even be possible in the code snippet you included. The final pixel tuples are stored in a temporary vector for that scanline, which contains placeholder values at first. Then for each channel, we step through the source samples in the slice, and for each sample we overwrite that one component of the corresponding pixel in the temporary vector. This is the snipped you included. Those final tuples in the temporary vector are given to the library's user one by one.

Edit: fixed some wrong information in the explanation above

E.g. for an f16 an [f16; 4] on the stack will suffice to get rid of atrocious f16 performance, although bigger buffers will see bigger gains from vectorization.

that sounds a lot like that simd library that offers an iterator based api, called faster

don't know if they do conversion though

That does sound like a good idea. Those buffers could even be allocated on the stack - we don't need them to be big.

yes! great idea :)

in

exrs/src/image/read/specific_channels.rs

Lines 296 to 298 in 49fece0

    
           SampleType::F32 => for pixel in pixels.iter_mut() { 
        
               *get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg)); 
        
           },

I suggest we add a function Sample::from_f32x12(slice) or a similar function. As an analogue to Sample::from_f32.Of course we'll do it for all the data types.

in the snippet, get_pixel should actually be called get_sample

Sounds good to me! Except the vector size should be a power of two.

yes yes. then we can just use a chunks iterator and that's it, pretty simple after all, all the worries I had were not necessary haha

Great! I'd prefer if you implemented it, since you're familiar with the code, and the difficulty is more in the API than in the algorithm. I'm happy to advise on the vectorization though!

I'll prepare a prototype :)

Once all the outstanding PRs are merged and this issue is squared away, I'd expect exrs to be roughly on par with the canonical OpenEXR implementation in terms of performance in multi-threaded mode (provided that intrinsics are used for f16 conversion).

We might even beat the canonical OpenEXR on ARM because the C implementation only includes SSE optimized functions and the scalar fallbacks are really naive, while exrs leans on the autovectorizer that can target any platform.

that would be great! I hope I'll have time on the weekend

	impl<Sample: FromNativeSample> SampleReader<Sample> {
	fn read_own_samples<'s, FullPixel>(
	&self, bytes: &'s[u8], pixels: &mut [FullPixel],
	get_pixel: impl Fn(&mut FullPixel) -> &mut Sample
	){
	let start_index = pixels.len() * self.channel_byte_offset;
	let byte_count = pixels.len() * self.channel.sample_type.bytes_per_sample();
	let mut own_bytes_reader = &bytes[start_index .. start_index + byte_count]; // TODO check block size somewhere

	let error_msg = "error when reading from in-memory slice";

	// match outside the loop to avoid matching on every single sample
	match self.channel.sample_type {
	SampleType::F16 => for pixel in pixels.iter_mut() {
	*get_pixel(pixel) = Sample::from_f16(f16::read(&mut own_bytes_reader).expect(error_msg));
	},

	SampleType::F32 => for pixel in pixels.iter_mut() {
	*get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg));
	},

	SampleType::U32 => for pixel in pixels.iter_mut() {
	*get_pixel(pixel) = Sample::from_u32(u32::read(&mut own_bytes_reader).expect(error_msg));
	},
	}

	debug_assert!(own_bytes_reader.is_empty(), "bytes left after reading all samples");
	}
	}

Pixel format conversions are slow