std::arch SIMD intrinsics

Question

std::arch SIMD intrinsics

gnzlbg opened this issue 6 years ago · comments

Currently the SIMD intrinsics are implemented in stdsimd using link_llvm_intrinsics to directly call the llvm intrinsics via their C ABI, and using a handful of "generic" simd intrinsics.

Is there a way to directly call Cretonne intrinsics?
Is there a cfg() macro available to detect whether the codegen backend is LLVM or Cranelift ?

bjorn3 · Answer 1 · Wed Nov 21 2018 15:29:07 GMT+0800 (China Standard Time)

Is there a way to directly call Cretonne intrinsics?

Cranelift doesn't have intrinsics. It does have simd types, which can be used with normal instructiosn like iadd. However I dont think all simd intrinsics have a cranelift instruction counterpart and implementing all of them takes time I would rather use to implement other things.

Is there a cfg() macro available to detect whether the codegen backend is LLVM or Cranelift?

Not yet, have been thinking about adding one though.

gnzlbg · Answer 2 · Wed Nov 21 2018 15:59:53 GMT+0800 (China Standard Time)

@bjorn3 check this out:

https://github.com/rust-lang-nursery/stdsimd/blob/master/coresimd/x86/sse.rs#L1986

This is how std::arch calls into the different simd instructions (not all of them, but many of them). The question is, would it be possible to write similar code to target Cranelift ? Or should we move all of these into platform-intrinsics and add the abstraction layer at the Rust codegen level ?

It does have simd types, which can be used with normal instructiosn like iadd

That sounds more like what packed_simd does, which uses some of rustc's generic simd intrinsics, e.g., see here: https://github.com/rust-lang-nursery/packed_simd/blob/master/src/codegen/llvm.rs

It might be easier to get packed_simd to work with Cranelift than to get std::arch working, but note that packed_simd also uses std::arch intrinsics in many cases to work arounds codegen bugs in LLVM... A cfg macro to detect the backend would be needed here to detect Cranelift and remove these workarounds, but we might have to potentially add other workarounds for Cranelift, by "somehow" calling Cranelift SIMD operations.

Even if this work is not there yet, I think work to "prepare" std::arch and packed_simdfor Cranelift can already start, and ideally using such a cfg macro we would get one single std::arch intrinsic first, write down the process to "support Cranelift", and try to get people involved, mentor them, etc.

Removing link_llvm_intrinsics from std::arch would be a bit of work, but it is possible. And there are python generators in Rust upstream that generate platform-intrinsics for these "automatically". Maybe porting those to Cranelift would be a way forward. We can then use cfg macros in std::arch to only expose the simd instructions that Cranelift supports, and can work on adding support to Cranelift for more simd instructions until we reach parity.

bjorn3 · Answer 3 · Wed Nov 21 2018 20:54:56 GMT+0800 (China Standard Time)

cg_clif currently puts non-primitives in stackslots which would kill simd performance.

bjorn3 · Answer 4 · Wed Nov 21 2018 20:57:01 GMT+0800 (China Standard Time)

The question is, would it be possible to write similar code to target Cranelift ?

Not without changes to cg_clif to intercept those intrinsics.

Even if this work is not there yet, I think work to "prepare" std::arch and packed_simdfor Cranelift can already start, and ideally using such a cfg macro we would get one single std::arch intrinsic first, write down the process to "support Cranelift", and try to get people involved, mentor them, etc.

+1

gnzlbg · Answer 5 · Wed Nov 21 2018 21:03:44 GMT+0800 (China Standard Time)

What would be desirable to start preparing stdsimd and packed_simd, is to clarify how the result should look like in those crates. cc @eddyb @sunfishcode

@eddyb was of the strong opinion that stdsimd should stop using link_llvm_intrinsics and start using platform intrinsics instead, but not all platform-intrinsics might be available for cranelift at least initially, so having some #[cfg(rustc_backend_llvm)] and #[cfg(rustc_backend_cranelift)] macros behind a feature gate in rustc would be useful for that, and also, to gate the llvm-specific workarounds of packed_simd.

The lowest-hanging fruit is probably to get packed_simd to work with cranelift, since many crates using std::arch also offer a packed_simd version behind a cargo feature to make their crate portable (e.g. rand). If we put all llvm workarounds behind feature gates in packed_simd, we would "only" have to implement the ~20 generic SIMD intrinsics for the cranelift backend.

Eduard-Mihai Burtescu · Answer 6 · Wed Nov 21 2018 21:50:20 GMT+0800 (China Standard Time)

I don't think littering stdsimd with #[cfg]s is a good idea - the proper solution is to allow mixing LLVM and Cranelift codegen units like @sunfishcode suggested, and long-term have something like platform-intrinsics built into Cranelift, used as the source of truth (many of the instructions can be dealt with in a compact declarative fashion), with a mapping to LLVM names as a secondary interface.

bjorn3 · Answer 7 · Wed Nov 21 2018 21:50:38 GMT+0800 (China Standard Time)

Using platform-intrinsics to get packed_simd working seems doable. The simd_reduce_* family doesn't seem to have cranelift counterparts though. (cc @sunfishcode)

bjorn3 · Answer 8 · Wed Nov 21 2018 21:55:16 GMT+0800 (China Standard Time)

the proper solution is to allow mixing LLVM and Cranelift codegen units like @sunfishcode suggested

I like the idea. There is currently an abi incompatibility between cg_clif and cg_llvm. (cg_clif always passes non primitives by-ref and uses the cranelift fast calling convention instead of System-V like cg_llvm) Other than that it kind of works already today. (metadata is put in the same place and symbol names are made the same way)

gnzlbg · Answer 9 · Wed Nov 21 2018 22:27:59 GMT+0800 (China Standard Time)

The simd_reduce_* family doesn't seem to have cranelift counterparts though.

I am not sure if these are necessary for an MVP of packed_simd working with cranelift or not. Maybe we could workaround these in cranelift, at least initially (e.g. by falling back to scalar code).

Eduard-Mihai Burtescu · Answer 10 · Wed Nov 21 2018 22:34:03 GMT+0800 (China Standard Time)

Yeah I suspect any upstreamed backend to use e.g. FnType and strictly adhere to the ABI.
We can even come up with a Cranelift-friendly ABI that we use for LLVM too - we just have to be consistent and do everything through FnType.

bjorn3 · Answer 11 · Wed Nov 21 2018 22:39:11 GMT+0800 (China Standard Time)

I don't understand what to do when getting PassMode::Cast.

Eduard-Mihai Burtescu · Answer 12 · Wed Nov 21 2018 22:43:37 GMT+0800 (China Standard Time)

You're supposed to pass the type's bytes as one or more immediates, as indicated by the information in the cast (e.g. rustc_target::abi::call::Reg tells you the register kind and size).

bjorn3 · Answer 13 · Wed Nov 21 2018 22:45:04 GMT+0800 (China Standard Time)

Got it. Thanks!

bjorn3 · Answer 14 · Sat Jul 27 2019 23:55:41 GMT+0800 (China Standard Time)

I implemented support for some simd_* intrinsics on the simd_emulation branch.

gnzlbg · Answer 15 · Sun Jul 28 2019 00:13:14 GMT+0800 (China Standard Time)

I suppose that doing a scalar emulation might be initially ok, but doesn't cranelift support emitting the appropriate instructions?

bjorn3 · Answer 16 · Sun Jul 28 2019 01:18:41 GMT+0800 (China Standard Time)

Yes, it does for most intrinsics, but cranelift is currently implementing real simd, instead of emulation like I did here. (bytecodealliance/cranelift#833, bytecodealliance/cranelift#855, bytecodealliance/cranelift#868) Because of this I think for example adding two vectors is broken. (cc @abrown, am I correct?) Also using real SIMD will not give much performance benifit yet, until some changes to the rest of cg_clif to not always store the vectors on the stack are performed and inlining of the std::arch functions is performed.

Andrew Brown · Answer 17 · Tue Jul 30 2019 00:19:02 GMT+0800 (China Standard Time)

Because of this I think for example adding two vectors is broken

I think this is only broken if you turn on the enable_simd setting because only a handful of SIMD instructions are implemented. Otherwise, the vectors are split up and--as far as I understand cranelift--the vector addition should work (see https://github.com/CraneStation/cranelift/blob/a5b17e8a0f044ec03b6982baea3757af43e70b7b/cranelift-codegen/src/isa/x86/abi.rs#L87-L95). More SIMD instructions are coming but we need to review and merge foundational stuff like bytecodealliance/cranelift#855 and bytecodealliance/cranelift#868 --any help is appreciated 😄.

bjorn3 · Answer 18 · Tue Jul 30 2019 00:27:43 GMT+0800 (China Standard Time)

I think this is only broken if you turn on the enable_simd setting

Of cource, forgot about that setting.

bjorn3 · Answer 19 · Tue Jul 30 2019 20:57:08 GMT+0800 (China Standard Time)

Opened #650.

bjorn3 · Answer 20 · Mon Jul 24 2023 21:30:41 GMT+0800 (China Standard Time)

#1378 and #1380 implemented a bunch more intrinsics.

bjorn3 · Answer 21 · Sat Oct 14 2023 21:55:40 GMT+0800 (China Standard Time)

Enough intrinsics are supported now that it seems like removing the hack to make is_x86_feature_detected!() return false in #1397 didn't cause issues for anyone. Or at least nobody reported an issue because of this change.

bjorn3 · Answer 22 · Tue Oct 24 2023 20:32:46 GMT+0800 (China Standard Time)

Rav1e and image now thanks to e5ba1e8 and a558968 respectively.