jonhoo / inferno

A Rust port of FlameGraph

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimize implementation for when there are many frames

jonhoo opened this issue · comments

#99 identified that performance drops significantly when there are many frames in the output (see #99 (comment)). We should figure out why that is, and fix it! Maybe it's time to flamegraph inferno ;)

As a start to this, I profiled and generated a flame graph of running inferno-flamegraph with --reverse to create many frames. It looks like a lot of time is spent in svg::write_str and filled_rectangle. Both of these spend most of the time adding attributes. I think implementing #50 will reduce the number of attributes written?

Very little time is spent in merge::frames. It's really just all of the attribute creation and formatting/writing the SVG tags that's taking most of the time.

image
image
image

flamegraph.svg.zip

Hmm, that push_attribute stack looks suspicious. Specifically, it seems to spend a bunch of time reserving additional memory in Vec::extend, which I thought we got rid of that in #58 (which fixed #41). @JordiChauzi any idea why that might be happening?

There's also a decent amount of time spent in number formatting there. I wonder whether we can use dtoa to significantly speed that up? Or alternatively if @bcmyers' num-format crate now has support for floating point, that may be able to take that job?

The buffered writes to output we can't do much about, though you are right that #50 should help a bit with that, since there'll just be less stuff to write overall.

svg::write_str wasn't reusing a cached svg::Event like everything else. I submitted #108 to fix this.

So I guess now that #108 and #110 have landed, we're probably down to number formatting being the biggest one?

Yes, I think so, but I'm not sure what we should do to improve it. Even though dtoa would be faster, I don't think it will allow us to format the way we need it (with at most two fractional digits).

Maybe @dtolnay or @bcmyers can give us a hint here? ^

If the bottleneck is formatting floats with at most two digits after the decimal point, you will want to fork one or both of dtoa or ryu and specialize it for that representation. I don't know which one would turn out faster for this use case.

Is the input f32 or f64? Can you provide more detail about the exact output format you need? Never exponential notation?

The ryu code is more approachable -- there is an incomprehensible piece that converts a float into mantissa and base-10 exponent which you can pretty much use unchanged, and a comprehensible piece that formats to text, of which we already have two alternative representations for you to base on.

The dtoa code would be less approachable but potentially faster if you find the right place to stick an early return when the necessary precision is reached. With ryu it is less feasible to avoid generating full precision in the mantissa.

@dtolnay all the inputs are f64.

I did some digging to provide some more detail, and here are the places we do float formatting in a tight loop:

let x = write!(buf, "{:.2}", item.x);
let y = write!(buf, "{:.2}", item.y);

"{} ({} {}, {:.2}%)",

There's also this, though I'm a little less worried about this one since it's only used for differentials. We could also probably work around the :+ in a different way.

"{} ({} {}, {:.2}%; {:+.2}%)",

So the format output is actually pretty straightforward. The numbers are also all fairly small, so I don't think we'll ever need exponential notation. Just returning at the right precision in something like dtoa seems like a great way to go about it. I guess it means we'll probably first do a write of the float into the StrStack, and then write that string into the final output.

We actually also spend a bit of time formatting integers in

let x = write!(buffer, "{}", rect.x1);
let y = write!(buffer, "{}", rect.y1);
let width = write!(buffer, "{}", rect.width());
let height = write!(buffer, "{}", rect.height());

But that's probably not worth doing anything about for the time being.

It does seem like dtoa (or perhaps ryu) is the way to go here.

Are all the floating point numbers pretty small (meaning not that many digits to the left of the decimal)? If that's the case, you don't really need comma separators and therefore don't really need num-format. Even if num-format could do floats (it can't right now), it would necessarily be slower than dtoa or ryu.

If my assumption is wrong and you would really like comma separators for some reason, I'd be happy to add floating point support to num-format so you could use it here instead, but, again, it can never be as fast as dtoa or ryu. It might even be significantly slower than those two if I can't come up with some kind of algorithm that a) is similar to either dtoa or ryu and b) let's one slot in comma separators in the right places. I haven't had the time to dig into that question, but it doesn't appear obviously doable, meaning there's a possibility that "floating point support" for num-format ends up meaning doing something really naive, like calling to_string and then inserting commas in the right places or something equally as awful from a speed perspective.

Hmm, actually, yeah, you're right, we specifically don't want comma separators, since these are used as px units in the browser. We only want the decimal point. So num-format is probably out of the question. It sounds like a modified dtoa is what we want then (or maybe we could send a PR to dtoa to make "truncate decimals" an option).

Assuming truncation is all you need (not rounding), something like the write_truncated function below might get you a speed-up without having to fork / modify dtoa. It probably wouldn't be as fast as modifying dtoa, but it might be faster than the current implementation because it leverages dtoa.

use std::io;

use rand::Rng;
use str_stack::StrStack;

fn main() -> Result<(), io::Error> {
    let mut rng = rand::thread_rng();
    let mut stack = StrStack::new();
    let mut buf = Vec::new();
    for _ in 0..100 {
        let f = rng.gen::<f64>() * 1000.0;
        write_truncated(&mut stack, &mut buf, f, 2)?;
    }
    println!("{:?}", &stack);
    Ok(())
}

fn write_truncated(
    stack: &mut StrStack,
    mut buf: &mut Vec<u8>,
    f: f64,
    precision: usize,
) -> Result<(), io::Error> {
    buf.clear();
    dtoa::write(&mut buf, f)?;
    let decimal_position = buf
        .iter()
        .position(|b| *b == '.' as u8)
        .expect("should always contain a '.'");
    let end = decimal_position + precision + 1;
    while buf.len() < end {
        buf.push('0' as u8);
    }
    let slice = &buf[..end];
    let s = unsafe { std::str::from_utf8_unchecked(slice) };
    stack.push(s);
    Ok(())
}

That's a good question -- I don't actually know what {:.2} does -- round, truncate, or something weirdly in-between. It's also not clear to me that it matters for coordinates.. @jasonrhansen ?

It looks like {:.2} rounds instead of truncates. I'm not familiar enough with the code base to know if we need rounding or can just get away with truncation.

That said, if we do not need rounding, here's a branch that uses dtoa and truncation that should be (I haven't benchmarked it) faster than the current implementation.

It appears to pass all the tests except those that it fails because rounding is different than truncation.

Perhaps someone who knows more about the benchmark setup you guys are using could see if this does indeed give a speed-up. If so, we can clean it up a little bit and perhaps put it in as a stop-gap on the road to modifying dtoa (assuming, of course, that truncation ends up being OK).

If it's worth pursuing (it may not be), I can create a PR to start collaborating on how to clean up this "proof-of-concept."

Branch: https://github.com/bcmyers/inferno/tree/float-formatting
Diff with master: bcmyers@935fd3b

It probably doesn't really matter if we truncate instead of round. It will slightly diverge in behavior from the Perl version, but if the performance increase is large enough, I'd say it's worth it. Let me do some benchmarking.

When I run the benchmarks with the default options, it doesn't look like there's much of an improvement.

flamegraph/dtrace       time:   [991.36 us 992.91 us 994.62 us]
                        thrpt:  [76.382 MiB/s 76.513 MiB/s 76.633 MiB/s]
                 change:
                        time:   [-1.6386% -1.2175% -0.7921%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7985% +1.2325% +1.6659%]
                        Change within noise threshold.

flamegraph/perf         time:   [3.8639 ms 3.8762 ms 3.8930 ms]
                        thrpt:  [158.36 MiB/s 159.05 MiB/s 159.56 MiB/s]
                 change:
                        time:   [-1.9801% -1.4496% -0.8547%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8621% +1.4709% +2.0201%]
                        Change within noise threshold.

It's a little better when I use the reverse_stack_order option, which produces many more frames.

flamegraph/dtrace       time:   [4.7432 ms 4.7517 ms 4.7625 ms]
                        thrpt:  [15.952 MiB/s 15.988 MiB/s 16.017 MiB/s]
                 change:
                        time:   [-3.1295% -2.7839% -2.4623%] (p = 0.00 < 0.05)
                        thrpt:  [+2.5245% +2.8636% +3.2306%]
                        Performance has improved.
flamegraph/perf         time:   [25.980 ms 26.151 ms 26.329 ms]
                        thrpt:  [23.416 MiB/s 23.575 MiB/s 23.731 MiB/s]
                 change:
                        time:   [-2.7176% -2.2753% -1.8368%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8711% +2.3283% +2.7935%]
                        Performance has improved.

@jasonrhansen that's interesting -- are you swapping out all the cases I highlighted above where we do formatting? What shows up as the primary hot-spots if you flamegraph the benchmark now?

I probably missed a couple of areas where we could speed up the code - Will do a more thorough pass sometime soon. Hopefully we can get more than a 1-2% speedup from this. Since hastily putting together the branch above I started writing a little standalone library with write_truncated and write_rounded functions that use dtoa so I could micro-benchmark them against write!(buf, "{:.2}", f) (i.e. against the standard library). It looks like we can speed up that specific function about 1.5-2x relative to the standard library (preliminary results are that a custom write_truncated takes about 65ns, a custom write_rounded takes about 80ns and the standard library equivalent (rounded) takes about 105ns). I ran out of time to continue working on this today, but am happy to continue to send you what I find in the coming days.

Here's the flame graph with the changes @bcmyers made.

image
image
image

flamegraph.svg.zip

That's interesting... From what I can tell, here are the current bottlenecks:

  1. Writing to STDOUT: ~16%
  2. quick_xml::write_event: ~13%
  3. StrStack::write: ~12%
  4. Frame merging: ~12% (a decent amount here is hashing, may be sped up by fnv in #113)
  5. write_truncated (~7%) + write_truncated2 (~5%) ~= 12%
  6. BytesStart::extend_attributes: ~11% (still some allocation happening in filled_rectangle?)
  7. Formatting integers: ~7%

It certainly seems like there aren't any obvious optimization targets that would give us huge wins here. It's all ~10%. I think the best candidates are likely 4, 5, and 7. 6 we should mostly have dealt with at this point, though there are still some allocations in filled_rectangle that I don't think should be showing up.

I tried using FnvHashMap for frame merging, but it didn't seem to make a difference.

Interesting.. I wonder if hashbrown + fnv would...

Something is still very weird here... Looking at the performance results in #114, --reverse is still significantly slower than without reverse. We should probably do a flamegraph diff between them to see where the extra time is spent.

This is a differential flame graph comparing with and without --reverse.

image
diff.svg.zip

That's super weird. I don't understand why it's so much slower given that profile :/

Well both flamegraph::filled_rectangle and svg::write_str have around 20x the number of samples when using --reverse.

Function Samples Samples --reverse
flamegraph::filled_rectangle 242 4,528
svg::write_str 282 5,741

Yeah, but the file is also larger. I guess there are more calls to the expensive drawing functions relative to the number of lines in the file because there is just less frame merging..

I benchmarked with hashbrown + fnv, but no change in performance was detected.