jonhoo / ordsearch

A Rust data structure for efficient lower-bound lookups

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ordsearch still slower than sorted_vec on AMD

jonhoo opened this issue · comments

I have an AMD Ryzen™ 9 7950X3D, and even with all of the latest improvements (thanks @bazhenov!), I'm still seeing ordsearch slower than sorted_vec (and often btreeset) in the benchmarks. I've uploaded my entire Criterion report, and the results appear consistent across multiple input sizes and the different benchmarks. Would be curious to hear exactly which benchmarks (and setups) folks are seeing ordsearch being faster for!

Also, @Qqwy, one downside of the new criterion setup is that it's a little harder (I think) to see which the overall trends, mostly because the benchmark results for different type sizes (u8, u16, etc.) are now in different folders. Could we somehow make the type sizes also be considered an "input" (with discrete steps) rather than completely independent benchmarks?

I ran new benchmarks and got results similar to yours. It's somewhat strange but promising. At least it looks like the results are consistent between AMD and Intel chips. I will look into this in a few days, when I will have more time.

Ok, I think I know what is going on.

The problem is here:

b.iter(|| {
r = r.wrapping_mul(1664525).wrapping_add(1013904223);
let r = std::cmp::min(r, MAX);
let x = black_box(T::try_from(r % size).unwrap());
let _res = black_box(search_fun(&c, x));
})

As I mentioned in #4 saturating computations are usually producing unfair benchmarks. In this particular case min(r, MAX) in addition to the fact that we are running LCG (r) on usize leads to the unfortunate result that most of the queries generated are strictly equals to the MAX. It makes query very predictable which is not fair comparison in general case.

I think we need to change the following

let r = std::cmp::min(r, MAX); 
let x = T::try_from(r % size);

to:

let max = MAX.min(size);
let x = T::try_from(r % max).unwrap();

Here are results for Search u32/...

lines

Those results make sense to me: on extremely small collections ordsearch is slower because of the overhead from memory prefetch. Otherwise ordsearch is still faster.

@Qqwy, what do you think?

@bazhenov Thank you very much for looking into this. Your conclusions make a lot of sense to me and I am definitely for making the proposed change 👍 .

I'll re-run the benchmark (before the change) on my M1 chip as well so we have data from the aarch64 architecture too.

Could we somehow make the type sizes also be considered an "input" (with discrete steps) rather than completely independent benchmarks?

I don't think it is possible to have multiple dimensions as input.

One possibility is to consider 'btreeset (u16)', 'btreeset (u32)', 'ordsearch (u16)' etc. as separate 'implementations' of the same benchmark. This would mean that we get one graph with more lines. Maybe that is useful?

Another approach could be to look at the raw JSON or CSV export of the benchmark run and create our own graph(s) from that. More flexible, but more work to set up.

I think one graph with more lines is probably fine 👍 Ideally we'd be able to do something like use color to indicate "type" and line type to indicate "size", but that may require custom plotting logic.