rust-ndarray / ndarray

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

Home Page:https://docs.rs/ndarray/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider marking split_at as #[inline]

HadrienG2 opened this issue · comments

In recursive domain decomposition (the easiest if not most performant way to parallelize iteration using rayon::iter::split() and optimize for cache locality at all cache levels), split_at can easily become a bottleneck. Letting it be inlined into the caller greatly reduces its performance impact.

commented

Conventional wisdom says it's already generic, so it's already possible for the compiler to inline it. But it depends a lot on mode of optimization. If it's just generic, that is no longer enough?

Seems so, at least I'm observing a case where ArrayViewMut::split_at is not inlined as of Rust 1.70.

commented

It sounds like you use rayon::iter::split. Take note that ndarray already implements split-based parallelism natively using rayon, in that case see Zip and parallel in the docs. :slightly_smiling_face:

Conventional wisdom says it's already generic, so it's already possible for the compiler to inline it. But it depends a lot on mode of optimization. If it's just generic, that is no longer enough?

Being generic implies being available for cross-crate inlining by necessity, which is the same as putting #[inline] on the function. But it does not mean that the heuristics for inlining are more likely to actually inline the function, which is different to using #[inline]. (Of course, generics are often indirectly more likely to be inlined, e.g. if they take a closure and hence by necessity are instantiated and called only once for that particular closure.)

It sounds like you use rayon::iter::split. Take note that ndarray already implements split-based parallelism natively using rayon, in that case see Zip and parallel in the docs. 🙂

I am doing a stencil computation (every output point is a weighted sum of a window of input points) and got the impression that this particular computation pattern is not supported by ndarray's built in parallelism facilities.

commented

Here is some links with info (they were useful to me and could be to others)

#[inline] and codegen-units interact in a significant way. Marking inline then can affect compilation a lot more with the codegen-units model, and it can slow down compilation time because of this as well. Since there has been steady development, I'm not sure the knowledge level of this all is so high(?) Adjusting release settings, changing to just one codegen unit or enabling thin lto etc are things one can do without changing the library.

With that said, adding #[inline] seems good to do often, but we should be reluctant to use it everywhere (any updates on this from ecosystem wisdom?)

Indeed, I regularly set codegen-units = 1 in my performance-sensitive projects in order to work around lost inlining of important std methods like Iterator::fold(), or just to make the performance of microbenchmarks more reproducible (with multiple codegen units, the "cut" between codegen units moves around depending on number of host cores or trivial code changes, resulting in methods flipping around between an inlined and non-inlined state, and this makes optimization work harder).

That being said, setting codegen-units = 1 does comes at the cost of a major degradation in compilation time (as does fat LTO, which is often necessary because in my experience the inlining heuristics of thin LTO are less good). Therefore, when I think that my use case for an #[inline] directive is not so narrow as to only concern myself, I open an upstream issue to request an #[inline] directive, as I did here.

In my opinion, caller-side inline directives as recently introduced by clang would often be better, but alas we don't have those in Rust. I opened an internal threads to see if there would be interest in them.

Putting on my library maintainer hat, I will usually add #[inline] directives to public methods of crates that I maintain, and to private methods that are called by public #[inline] methods, if I have reasons to believe that lack of inlining could be a major performance liability for the caller in reasonably common use cases.

Being generic implies being available for cross-crate inlining by necessity

Out of curiosity, @HadrienG2, are you using any sort of -Zshare-generics style flags? Because as we make it possible to (re)use generic monomorphizations between CGUs, the quoted statement becomes less true, so perhaps the conventional wisdom should be revisited (and/or the CGU partitioning scheme, if unique monomorphizations are being split out from their unique caller's CGU). I don't think we've enabled cross-crate generics sharing yet, but we might've enabled it for single-crate cross-CGU; I'm not sure one way or the other.

An example program where the (non)presence of #[inline] on the generic function (applied using source patching to change just that and nothing else) makes a meaningful performance impact would do a lot of good for showing that it is the (non)presence of #[inline] causing the performance cliff you're observing. Obviously since it's GCU-partitioning dependent, you won't really be able to minimize an example much, but a full project example would still be helpful, if it's available to be shared.

(Further general discussion is in the irlo thread; this is what's immediately relevant here.)

I will usually add #[inline] directives to public methods of crates that I maintain [...] if I have reasons to believe that lack of inlining could be a major performance liability for the caller in reasonably common use cases.

General wisdom has in fact been to let the inlining heuristics do their job by default, so if a function is generic (thus, at least historically, already necessarily monomorphized into each CGU that uses it and available to inlining) then it shouldn't also be marked #[inline], so that the standard inlining heuristics will still apply.

Developers have historically been really bad at intuiting good inlining heuristics, to the point that C/C++ compilers typically ignore the inline keyword for the purpose of inlining heuristics (apocryphally, at least; I just did a quick search and this might not be as true as I previously thought). Your heuristic is probably reasonable for such "weak" inlining (codegen this separately into each CGU and make available to standard inlining heuristics), but is probably overeager for heuristic-adjusting inlining hints (telling the optimizer to consider the function as a better candidate for inlining than it otherwise would; my quick search suggested LLVM cuts the inlining badness score by a factor of 8 for functions annotated with inlinehint).

The inlining heuristics are pretty good at their job. We should prefer not to adjust them until observing evidence that they aren't sufficient for a specific case, and to not adjust them ahead of time off of "inlining is probably outsizedly beneficial here" vibes. In the typical case, the optimizer knows more accurately than you when stuff should be inlined. (Not always! Which is why the hints exist. But usually.)

@CAD97 Since you also asked your question about compiler flags in the irlo thread, I replied to it there. The short answer is that no, I am not using -Zshare-generics or similar, just the standard rustc v1.70 stable configuration.

The code is public, I will extract some reproducers of pathological behavior once I have something better than a cellphone at hand. I guess I should rather post these on irlo than here, since they are probably not of particular interest to an ndarray audience.

Regarding the wisdom of applying #[inline] directives eagerly before benchmarking, I started from the viewpoint that you describe (let the optimizer do its job), but have gradually shifted away from it and towards more proactive inlining over time due to overly frequent inlining issues in my rust projects.

If rustc wants me to trust it, it will need to get good enough to earn my trust, right now it's unfortunately not quite there yet in the presence of multiple crates and multiple codegen units, the latter of which is both the default configuration (i.e. what my users will see unless they or I do something to change it) and highly preferable for compile times (i.e. not something I would like to change myself if I can avoid to).

commented

There's a bunch of method calls used used between ArrayViewMut::split_at and RawViewMut::split_at that look like small methods that should be #[inline].
However the main logic is implemented in RawArrayView::split_at and is larger than my gut feeling threshold for inline, I'm not sure if it should be marked. Have you tried, do you know if it actually improves if you do this? I think I could be brought onboard with that change, there's a lot of bounds checking that can probably be removed after inlining.

In my opinion, caller-side inline directives

It's not really the same thing but I think it's fun and I want to share the information: we have definition-side inline directives on closures and that almost looks like caller side. I've experimented with using those instead of macro based unrolling (matrixmultiply). It looks like this in use: foo(#[inline(always)] |x| x + 1) and it has effect. The cumulative effect of foo(#[inline(always)] |x| f(x)) is not something I have considered before, but since inlining is bottom-up it could maybe work(?) to force the call to f to be inlined.

There's a bunch of method calls used used between ArrayViewMut::split_at and RawViewMut::split_at that look like small methods that should be #[inline].
However the main logic is implemented in RawArrayView::split_at and is larger than my gut feeling threshold for inline, I'm not sure if it should be marked. Have you tried, do you know if it actually improves if you do this? I think I could be brought onboard with that change, there's a lot of bounds checking that can probably be removed after inlining.

In my use case, it was specifically ndarray::impl_views::splitting::<impl ndarray::ArrayBase<ndarray::ViewRepr<&mut A>,D>>::split_at aka ArrayViewMut::split_at that failed to inline, most likely because it was the one that I called across a crate boundary. The inlining of lower-level functions transitively called by that function worked just fine.

It's not really the same thing but I think it's fun and I want to share the information: we have definition-side inline directives on closures and that almost looks like caller side. I've experimented with using those instead of macro based unrolling (matrixmultiply). It looks like this in use: foo(#[inline(always)] |x| x + 1) and it has effect. The cumulative effect of foo(#[inline(always)] |x| f(x)) is not something I have considered before, but since inlining is bottom-up it could maybe work(?) to force the call to f to be inlined.

As far as I understand inlining directives, this would only force rustc to inline the closure, and not the other functions transitively called by the closure. But maybe @CAD97 can comment on this.