dimforge / simba

Set of mathematical traits to facilitate the use of SIMD-based AoSoA (Array of Struct of Array) storage pattern.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missed optimization opportunity in Simd::new

farnoy opened this issue · comments

I'm trying to initialize a Simd<f32x4> value from memory, but the safe way of doing it does a function call.

On simba 0.2.0, using the following rust code:

#[inline(never)]
pub fn exact_chonker_1(xs: &[f32]) -> simd::f32x4 {
    let mut acc = simd::f32x4::splat(1.0);
    let mut chunk_iter = xs.chunks_exact(4);

    for chunk in &mut chunk_iter {
        acc *= simd::f32x4::new(chunk[0], chunk[1], chunk[2], chunk[3]);
    }

    let mut r = simd::f32x4::splat(9999.0);
    let mut mask = simd::m32x4::splat(false);

    for (ix, item) in chunk_iter.remainder().iter().enumerate() {
        r.replace(ix, *item);
        mask.replace(ix, true);
    }

    acc * r.select(mask, simd::f32x4::splat(0.0))
}

#[inline(never)]
pub fn exact_chonker(xs: &[f32]) -> simd::f32x4 {
    let mut acc = simd::f32x4::splat(1.0);
    let mut chunk_iter = xs.chunks_exact(4);

    for chunk in &mut chunk_iter {
        acc *= simd::f32x4::from(*unsafe { std::mem::transmute::<*const f32, &[f32; 4]>(chunk.as_ptr()) });
    }

    let mut r = simd::f32x4::splat(9999.0);
    let mut mask = simd::m32x4::splat(false);

    for (ix, item) in chunk_iter.remainder().iter().enumerate() {
        r.replace(ix, *item);
        mask.replace(ix, true);
    }

    acc * r.select(mask, simd::f32x4::splat(0.0))
}

Analyzing the assembly, I'm seeing the initialization for f32x4::new() work like this:

 mov     r14, qword, ptr, [rip, +, _ZN5simba4simd16packed_simd_impl61Simd$LT$packed_simd..Simd$LT$$u5b$f32$u3b$$u20$4$u5d$$GT$$GT$3new17h20fdaa67639b0938E@GOTPCREL]
 xor     ebp, ebp
 lea     r12, [rsp, +, 64]
.LBB10_4:
 vmovaps xmmword, ptr, [rsp, +, 48], xmm3
 vmovd   xmm0, dword, ptr, [r13, +, 4*rbp]
 vmovd   xmm1, dword, ptr, [r13, +, 4*rbp, +, 4]
 vmovss  xmm2, dword, ptr, [r13, +, 4*rbp, +, 8]
 vmovss  xmm3, dword, ptr, [r13, +, 4*rbp, +, 12]
 mov     rdi, r12
 call    r14
 vmovaps xmm3, xmmword, ptr, [rsp, +, 48]

So the compiler seems to really want to call this Simd::new method, despite not using the results (xmm0-2 are not used as source operands later).

With the 2nd variant of the code, as far as I can tell the first loop just translates to:

.LBB11_4:
 vmulps  xmm0, xmm0, xmmword, ptr, [rsi, +, 4*rcx]
 add     rcx, 4
 cmp     rdx, rcx
 jne     .LBB11_4

Which is so much cleaner.

Is there anything you could think of that is inhibiting this for ::new()? Also, is there a better way to initialize from packed memory with safe rust?

Hi!

The fact the compilers really wants to call ::new() may be caused by the fact that ::new is not marked as #[inline] there. Adding the issing #[inline]may fix this.

Also, is there a better way to initialize from packed memory with safe rust?

We could add a .from_slice_unaligned() to use instead of ::new. And perhaps it would make sense to add some kind of .from_iterator(iter, default) which fills the lanes of the f32x4 with the iterator content, or default when the iterator does not yield enough values to fill all the lanes.

Thanks for the quick response!

I'll try the #[inline] idea soon. I'm new to these optimizations, do they only make sense for libraries where you want inlining across crates? Does the annotation also affect same-crate but different module function calls?

What would be the contract for from_slice_unaligned()? That sounds like reading from &[u8] where I'm interested in &[Self::Element], which would have to be aligned?

I'll try the #[inline] idea soon. I'm new to these optimizations, do they only make sense for libraries where you want inlining across crates? Does the annotation also affect same-crate but different module function calls?

The #[inline] is mostly for inline across crates, yes. I think it can also affect inlining within the same crate when the compiler uses multiple codegen units.

What would be the contract for from_slice_unaligned()? That sounds like reading from &[u8] where I'm interested in &[Self::Element], which would have to be aligned?

That would be reading from &[Self::Element]. It would basically just call the corresponding from_slice_unaligned function from packed_simd. It is unaligned because we have no guarantee that &slice[0] is aligned to, e.g., a 16-byte boundary (i.e. align_of::<f32x4>()).

Inlining helps, nice catch!

That would be reading from &[Self::Element]. It would basically just call the corresponding from_slice_unaligned function from packed_simd. It is unaligned because we have no guarantee that &slice[0] is aligned to, e.g., a 16-byte boundary (i.e. align_of::()).

Got it, I forgot that SIMD loads have stricter alignment requirements. In my experiments, rustc usually outputted the aligned versions of instructions, but I guess it's seen how the allocation is done to begin with.

All of this sounds good to me. I don't love the fact that these functions can panic. Will send a PR shortly