johannesvollmer / exrs

Profling shows that on the read_single_image_from_buffer_rgba_channels benchmark, 50% of the time is spent in the half::binary16::f16::from_f32 function.

Interactive profile so you can explore it yourself: https://share.firefox.dev/3Vt2pD6

Specifically, the profiler blames these lines:

exrs/src/image/read/specific_channels.rs

Lines 296 to 298 in 49fece0

    
           SampleType::F32 => for pixel in pixels.iter_mut() { 
        
               *get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg)); 
        
           },

Is the benchmark actually about loading an f32 image as f16? I expected it to measure a straightforward load.

I've sent a PR to half to allow inlining the conversion functions - it seems the #[inline] attribute was missed on them: starkat99/half-rs#61

~~However, the benchmarks of exr are built with LTO, so the inline attribute shouldn't matter. In fact, LTO was needed to paper over just this one missing #[inline]!~~

~~With this change to half, the benchmarks with and without LTO are the same, and you can enjoy good compile times AND good performance!~~ Wait, no, that's something else, or at least that PR didn't help.

Is the benchmark actually about loading an f32 image as f16? I expected it to measure a straightforward load.

Maybe we should rename this benchmark, as it seems to also convert the numbers :)

Are you interested in profiling f32 or in profiling f16?

There are files of both types in the repository, but most have a small number of pixels and therefore do not represent the common real world data. I don't know if there is a large f16 file right now.

You are welcome to add the benchmarks that you need, for example a benchmark that loads f32 values from an f32 file :)

The conversion is something that will happen in the real world, so we want it to be fast. I think I remember there are intrinsics for converting between f32 and f16, maybe they need to be activated with a flag in the half dependency?

Edit:

use-intrinsics - Use hardware intrinsics for f16 and bf16 conversions if available on the compiler host target. By default, without this feature, conversions are done only in software, which will be the fallback if the host target does not have hardware support. Available only on Rust nightly channel.

std - Enable features that depend on the Rust std library, including everything in the alloc feature. Enabling the std feature enables runtime CPU feature detection when the use-intrsincis feature is also enabled. Without this feature detection, intrinsics are only used when compiler host target supports them.

shouldn't it be possible for users to provide their own half dependency with that feature enabled?

Yes, I was surprised that a conversion is performed. It would be nice to rename the benchmark.

I think I remember there are intrinsics for converting between f32 and f16, maybe they need to be activated with a flag in the half dependency

There are two problems with this.

First, on x86 only very very recent CPUs have a native f16 type - ~~according to Wikipedia, CPUs supporting that haven't even launched yet, but are coming in 11 days from now.~~ that's for arithmetic, conversions are available since 2009. There are suitable intrinsics on ARM, but they're not supported by half or even the Rust standard library.

Second, those conversion intrinsics operate on a chunk of values - e.g. &[f32; 4] while exr crate calls conversion on individual f32 values, so even if the intrinsics were available, exr wouldn't be able to make effective use of them.

On the exr side the actionable part is switching from converting f32 to &[f32; 4]. ~~But I don't think this will result in any immediate gains, at least without further modifications in half, ideas for which I'll take to the half issue tracker.~~

However, operating on images in the f16 pixel format is a terrible idea. CPUs do not implement f16 natively; even the ones with f16 intrinsics only ever operate on [f16; 8] to [f16; 32], never individual values. half explicitly calls out in documentation that CPUs do not implement f16 and any math on these values is done by converting them to f32 and back, which is very slow (even on CPUs with intrinsics, because those operate only on chunks).

The only use case is shipping them to a GPU and displaying them there, since GPUs generally do support f16 natively. So the relevant benchmark should be explicitly marked as an exotic use case.

I was wrong about intrinsics not being available, they have been available on x86_64 CPUs since 2009.

This issue is getting hard to follow, I'll close this and open another one with the actionable takeaways.

haha alright, don't worry :) no big deal

key take away here is: please modify or add benchmarks as you see fit :)

	SampleType::F32 => for pixel in pixels.iter_mut() {
	*get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg));
	},

Converting from `f32` to `f16` incurs a 2x slowdown on reading images