dfockler / faster

SIMD for humans

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

faster

SIMD for Humans

Easy, powerful, absurdly fast numerical calculations. Chaining, Type punning, static dispatch (w/ inlining) based on your platform and vector types, zero-allocation iteration, vectorized loading/storing, and support for uneven collections.

It looks something like this:

let lots_of_3s = (&[-123.456f32; 128][..]).simd_iter()
    .map(|v| { f32s::splat(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() -
               f32s::splat(4.0) - f32s::splat(2.0) })
    .scalar_collect::<Vec<f32>>();

Which is analogous to this scalar code:

let lots_of_3s = (&[-123.456f32; 128][..]).iter()
    .map(|v| { 9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() -
               4.0 - 2.0 })
    .collect::<Vec<f32>>();

The vector size is entirely determined by the machine you’re compiling for - it attempts to use the largest vector size supported by your machine, and works on any platform or architecture (see below for details).

Compare this to traditional explicit SIMD:

use std::mem::transmute;
use stdsimd::{f32x4, f32x8};

let lots_of_3s = &mut [-123.456f32; 128][..];

if cfg!(all(not(target_feature = "avx"), target_feature = "sse")) {
    for ch in init.chunks_mut(4) {
        let v = f32x4::load(ch, 0);
        let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) };
        let abs_mask = f32x4::splat(scalar_abs_mask);
        // There isn't actually an absolute value intrinsic for floats - you
        // have to look at the IEEE 754 spec and do some bit flipping
        v = unsafe { _mm_and_ps(v, abs_mask) };
        v = unsafe { _mm_sqrt_ps(v) };
        v = unsafe { _mm_rsqrt_ps(v) };
        v = unsafe { _mm_ceil_ps(v) };
        v = unsafe { _mm_sqrt_ps(v) };
        v = unsafe { _mm_mul_ps(v, 9.0) };
        v = unsafe { _mm_sub_ps(v, 4.0) };
        v = unsafe { _mm_sub_ps(v, 2.0) };
        f32x4::store(ch, 0);
    }
} else if cfg!(all(not(target_feature = "avx512"), target_feature = "avx")) {
    for ch in init.chunks_mut(8) {
        let v = f32x8::load(ch, 0);
        let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) };
        let abs_mask = f32x8::splat(scalar_abs_mask);
        // There isn't actually an absolute value intrinsic for floats - you
        // have to look at the IEEE 754 spec and do some bit flipping
        v = unsafe { _mm256_and_ps(v, abs_mask) };
        v = unsafe { _mm256_sqrt_ps(v) };
        v = unsafe { _mm256_rsqrt_ps(v) };
        v = unsafe { _mm256_ceil_ps(v) };
        v = unsafe { _mm256_sqrt_ps(v) };
        v = unsafe { _mm256_mul_ps(v, 9.0) };
        v = unsafe { _mm256_sub_ps(v, 4.0) };
        v = unsafe { _mm256_sub_ps(v, 2.0) };
        f32x8::store(ch, 0);
    }
}

Even with all of that boilerplate, this still only supports x86-64 machines with SSE or AVX - and you have to look up each intrinsic to ensure it’s usable for your compilation target.

Upcoming Features

More intrinsic traits are coming; feel free to open an issue or pull request if you have one you’d like to see.

Swizzling, automated testing, and documentation are also in the pipeline.

Compatibility

Faster currently supports 32- and 64-bit x86 machines with SSE and above, although AVX-512 support isn’t working in rustc yet. Support for non-x86 architectures is currently blocked by stdsimd and rustc.

Of course, once those issues are resolved, adding support for ARM, MIPS, or any other intrinsics and vector lengths will be trivial.

Performance

Here are some extremely unscientific benchmarks which, at least, prove that this isn’t any worse than scalar iterators. Even on ancient CPUs, a lot of performance can be extracted out of SIMD.

$ RUSTFLAGS="-C target-cpu=ivybridge" cargo bench # host is ivybridge
test tests::bench_map_scalar         ... bench:         890 ns/iter (+/- 5)
test tests::bench_map_simd           ... bench:         124 ns/iter (+/- 0)
test tests::bench_map_uneven_simd    ... bench:         122 ns/iter (+/- 0)
test tests::bench_nop_scalar         ... bench:          24 ns/iter (+/- 0)
test tests::bench_nop_simd           ... bench:          24 ns/iter (+/- 0)
test tests::bench_reduce_scalar      ... bench:         860 ns/iter (+/- 2)
test tests::bench_reduce_simd        ... bench:         102 ns/iter (+/- 0)
test tests::bench_reduce_uneven_simd ... bench:          99 ns/iter (+/- 0)

$ RUSTFLAGS="-C target-cpu=pentium3" cargo bench # host is ivybridge
test tests::bench_map_scalar         ... bench:         896 ns/iter (+/- 2)
test tests::bench_map_simd           ... bench:         247 ns/iter (+/- 1)
test tests::bench_map_uneven_simd    ... bench:         240 ns/iter (+/- 0)
test tests::bench_nop_scalar         ... bench:          16 ns/iter (+/- 0)
test tests::bench_nop_simd           ... bench:          18 ns/iter (+/- 0)
test tests::bench_reduce_scalar      ... bench:         868 ns/iter (+/- 2)
test tests::bench_reduce_simd        ... bench:         235 ns/iter (+/- 0)
test tests::bench_reduce_uneven_simd ... bench:         222 ns/iter (+/- 0)

About

SIMD for humans

License:Mozilla Public License 2.0


Languages

Language:Rust 100.0%