Find and fix possible performance bottlenecks

Question

Find and fix possible performance bottlenecks

mre opened this issue 6 years ago · comments

Yesterday I did some profiling using the setup described here.
The resulting callgrind file is attached. This can be opened with qcachegrind on Mac or kcachegrind on Linux.

callgrind.out.35583.zip

If you don't have any of those programs handy, I've added a screenshot for the two main bottlenecks that I can see. I'm not an expert, but it looks like we spend a lot of time allocating, converting, and dropping the BTreeMap, which will be converted to a dictionary and returned to Python in the end.

I guess we could save a lot of time by making this part more efficient. E.g. by copying less and instead working on references. Might be mistaken, though. Help and pull requests are very welcome.
😊

Matthias Endler · Answer 1 · Wed Jul 04 2018 06:18:28 GMT+0800 (China Standard Time)

@RSabet, @wdv4758h fyi

Chiu-Hsiang Hsu · Answer 2 · Wed Jul 04 2018 21:31:15 GMT+0800 (China Standard Time)

Nice to know, thanks for keep me posted.

I already notice the performance is not so awesome the last time I randomly try on my machine with timeit.

Matthias Endler · Answer 3 · Wed Jul 04 2018 21:48:23 GMT+0800 (China Standard Time)

Yeah, it's definitely not. I guess at least partially because of needless copying. Would be nice to have a tool to visualize allocations and memory usage. I looked around but couldn't find any...

I guess I would start by optimizing this BTreeMap now.
The way I generated the above screenshot was by doing the following:

cargo build
valgrind --tool=callgrind --main-stacksize=1000000000 target/debug/hyperjson-bench
callgrind_annotate --auto=yes callgrind.out.35583 >out.rs
qcachegrind callgrind.out.35583

Running this from a Mac.
Note that I'm using a debug build here as suggested in this article to preserve the symbols in the binary and have "pretty output" in the end.
I also created a stand-alone bench.rs binary for this. Could have used benchmark tests, but that would be one more thing that requires nightly Rust and it wasn't obvious to me how to call cachegrind on the binary created from the benchmark test. We can change that later. For now, I guess it's okay for quickly testing performance improvements.

Matthias Endler · Answer 4 · Mon Jul 09 2018 16:45:07 GMT+0800 (China Standard Time)

Oh yeah, and if you profile don't forget to do that on a release build with debug flags.
Add this to your Cargo.toml:

[profile.release]
debug = true

More info:
https://doc.rust-lang.org/book/second-edition/ch14-01-release-profiles.html

Matthias Endler · Answer 5 · Tue Nov 20 2018 07:43:59 GMT+0800 (China Standard Time)

This profiling run is outdated by now. Also, the issue is not really actionable as profiling and performance improvements will always be an ongoing effort. Therefore I'm closing this to keep the issue tracker clean.
As a not, if you have Linux and would like to profile, simply run make profile and you get a flamegraph. See the instructions in the README.md on how to set up perf for that.