Profile and improve map performance

Question

Profile and improve map performance

domenicquirl opened this issue 4 years ago · comments

We just merged #43, which gives us benchmarks! Currently we have benches against dashmap and hashbrown.

First runs of the benchmarks seem to indicate that retrieving values from the map is quite performant, but inserting into the map is rather slow compared to both other maps. Preliminary results:

against hashbrown, insert and insert_erase run in the order of one ms, compared to 10-20 us in hashbrown. get holds up reasonably well though, still ~20 us to ~5-6, but given the added concurrency this seems much better than insert. Both of these use one guard across all operations of one benchmark run and relatively consistent across input distributions.
against dashmap, dashmap inserts in something between 1 ms with 1 thread and .8 ms for multiple threads, however what's probably more important is throughput. This for some reason is not in the report produced by criterion, but was on the order of 100 MElements/s. We compare with about 25-12 ms depending on thread count, which is so slow it takes several minutes to run any benchmark and puts throughput at something like 5, maybe 10 MElements/s. This is using 1 guard per map operation, which admittedly is not what you would do in a scenario such as the benchmark.
against dashmap, but with only one guard per thread (the different setup), we insert in 14-11 ms, which is better, but still considerably slower than dashmap. Note that this does not compare directly due to the different setup of the threads.
get against dashmap seems really good! dashmap gets 700-500 us, while we get 1.5 ms on one thread, 800us on 2 threads and go down to below 300 us on 8 threads, even with pin() for every call to get. With only one guard, this improves to 800 to a bit more than 200 us, depending on thread count.

It would be interesting to find out if these results are reproducible and, if so, what causes inserts to be so slow. @jonhoo suggests graphing the results of perf record as a profiling method in the discussion in #43, maybe this can be used as a starting point for the analysis.

Jon Gjengset · Answer 1 · Fri Jan 31 2020 07:11:35 GMT+0800 (China Standard Time)

To start people off, here is a flamegraph of insert_flurry_hashbrown/low_guard_once:

Looking at the profile, the stack on the right (ignoring the gnuplot stack) is probably what we should look at (direct link). From what I can tell, table initialization is really the big thing here, which isn't really what we want the benchmark to measure. We should look into how to avoid measuring that when measuring the cost of inserts.

Jon Gjengset · Answer 2 · Fri Jan 31 2020 07:12:47 GMT+0800 (China Standard Time)

When looking at the cost of inserts, the biggest cost I see is allocation (Owned::new), which isn't terribly surprising. This is a cost we have to pay, but many other implementations do not. I wonder if we could save a fair bit by keeping a pool of Owned that can be re-used rather than defer_destroying them immediately...

DQ · Answer 3 · Fri Jan 31 2020 07:35:30 GMT+0800 (China Standard Time)

Interesting. The hashbrown benchmarks create the map once and then clear() it at the start of every iteration. This should definitely be added to our benchmarks once #42 is done. I contemplated using retain(|_,_| false) instead, on my machine this performs somewhat faster (significant difference), but still only on the order of a few hundred us. We should have an eye on improvements from clear() though.

Overall, I'm more concerned with the comparison to dashmap since it is more directly comparable in usage and the difference is so much larger. It is also worth noting that dashmap does create a new map in each iteration in their benchmarks for insert (see task_insert_dashmap_u64_u64()).

DQ · Answer 4 · Fri Jan 31 2020 07:42:22 GMT+0800 (China Standard Time)

Re: Owned what would support your thoughts about re-using them is that Owned::drop is also a sizeable portion of transfer, using slightly more time than malloc according to the graph.

Edit: transfer is also responsible for most of the time used by with_capacity and with_capacity_and_hasher, as it is used to help with initialisation, so these methods would transitively profit from this.

Jon Gjengset · Answer 5 · Fri Jan 31 2020 07:43:50 GMT+0800 (China Standard Time)

Here is the flamegraph for dashmap::insert_u64_guard_once:

Jason Hinch · Answer 6 · Sat Feb 01 2020 05:22:09 GMT+0800 (China Standard Time)

If creating and dropping Owned within the transfer method is a major source of performance issues, one option might be to share the Moved nodes allocated to point to the new table, instead of allocating one for each entry. Each BinEntry::Moved is exactly the same in terms of the data it holds and Atomic which stores the bin entry can take both a Owned or a Shared as the pointer for its input.

Jon Gjengset · Answer 7 · Sat Feb 01 2020 05:29:49 GMT+0800 (China Standard Time)

@jhinch that could work, though the tricky part is going to be to find the right way to free them. If we can figure that out though, that would probably help a lot.

DQ · Answer 8 · Sat Feb 01 2020 06:11:24 GMT+0800 (China Standard Time)

I'd say the Moved would be associated with the table where it is used and could be marked as garbage together with the containing table? So either with Table::drop or explicitly when a resize finishes. Terminating resizes is the point where the old table is no longer accessible, so no new threads can get a reference to the Moved and it could be defer_destroyed and collected after the references to the old table are gone? This would be similar to the argument about the safety of defer_destroying the old table here. I'll think about this some more when I'm less tired, but also I don't have that much free time at the moment, so we'll see.

Jon Gjengset · Answer 9 · Sat Feb 01 2020 06:29:27 GMT+0800 (China Standard Time)

That's a good observation, and yes, I think that could work! That should give a very nice speedup for resizes indeed. Anyone want to write up a PR?

Jon Gjengset · Answer 10 · Sat Feb 01 2020 22:32:44 GMT+0800 (China Standard Time)

Thinking some more about this, I think the trick will be to allocate a Moved for a table when its replacement is allocated, and then to store a pointer to that in the Table struct of the table that is being moved from. And then in Table::drop we mark it for deletion. We have to be a little careful though that we don't have any outstanding references to it, even inside drop, when we destroy it.

DQ · Answer 11 · Sun Feb 02 2020 18:28:42 GMT+0800 (China Standard Time)

As an update on hashbrown, I changed the benchmark to use clear now that we have it (see #55), which gets insert down to about 260us on my machine. The other benchmarks are untouched since they do not use clear, though interestingly get is now down to ~7-8us for me as well, matching hashbrown more closely. For now, this seems fine, except that I don't know where the rather big difference for insert_erase originates.

DQ · Answer 12 · Sun Feb 02 2020 18:34:35 GMT+0800 (China Standard Time)

@jonhoo That's what I meant, yeah. Sorry if that was unclear from my previous, rather tired comment 😅

Could you expand on the problem you see with existing references at the point of droping the old table? The safety guarantee we need to provide for defer_destroy is only that no new references to the Moved can be obtained by other threads, which should be the case if drop for the old table is called due to a resize finishing as then the map will refer new threads to the new table. Existing references would be protected by the guard that obtained them, since as long as they are alive the old table cannot be dropped, no?

Jon Gjengset · Answer 13 · Mon Feb 03 2020 00:19:53 GMT+0800 (China Standard Time)

@domenicquirl Ah, but notice that the drop code uses epoch::unprotected, which destroys immediately.

DQ · Answer 14 · Mon Feb 03 2020 00:21:50 GMT+0800 (China Standard Time)

I found some time to try and implement this in #56. The result on my machine is a very significant performance improvement:

insert is down to 18ms on one thread and then 9ms on 2 down to 6.5 on 8 threads. Holding one guard for all inserts improves this to 9-6ms
get gets down to 200us with multiple guards and below 100 us with one guard

DQ · Answer 15 · Mon Feb 03 2020 00:25:11 GMT+0800 (China Standard Time)

@domenicquirl Ah, but notice that the drop code uses epoch::unprotected, which destroys immediately.

When working on it I noticed you were also probably referring to other references from bins in the same table, which I did not consider immediately. Still, destroying it immediately is also what I do at the moment, under the assumption that at the point of Table::drop all references to the table were in a previous epoch and are now released 🤔

Jon Gjengset · Answer 16 · Mon Feb 03 2020 00:30:20 GMT+0800 (China Standard Time)

Ah, no, what I was thinking about was making sure that all the pointers that point to the shared BinEntry::Moved have been dropped before you drop it. The approach you've taken in #56 does this, so I think you're all good!

You can assume that all references to the table were indeed in a previous epoch. This was the core of the argument in a9c6890, and is why all returned references are tied to both the lifetime of &self and &Guard.

Jon Gjengset · Answer 17 · Wed Feb 05 2020 07:35:12 GMT+0800 (China Standard Time)

Updated flamegraphs:
flurry_hashbrown insert_flurry_hashbrown/low_guard_once

flurry_dashmap insert_flurry_u64_u64_guard_once/1

Jon Gjengset · Answer 18 · Wed Feb 05 2020 07:36:56 GMT+0800 (China Standard Time)

I wonder how different these numbers would look with jemalloc... We're certainly being bitten by the cost of allocation here, which dashmap and hashbrown do not have to pay. I'd be interesting to do a scale-up experiment with more cores to see how our contention story compares! That is arguably more important than the absolute numbers for any single operation.

Jon Gjengset · Answer 19 · Wed Feb 05 2020 07:38:36 GMT+0800 (China Standard Time)

It is particularly bad that we have to do two allocations for every put -- one for the value and one for the node. We could at least to a little better by not allocating for the node if we end up doing a replacement. Keeping a "graveyard" for allocated Owned might also save us a bunch.

Jon Gjengset · Answer 20 · Wed Feb 05 2020 23:41:44 GMT+0800 (China Standard Time)

Thinking some more about this, I don't think a graveyard will help us much. Unless the user does a lot of removals or replacements (which I don't think are that common — updates maybe?), there probably won't be much garbage for us to re-use, and it won't be worth the bookkeeping overhead.

Jon Gjengset · Answer 21 · Thu Feb 27 2020 02:39:50 GMT+0800 (China Standard Time)

I wrote a first draft of a more "serious" concurrent benchmark: https://github.com/jonhoo/bustle

Jon Gjengset · Answer 22 · Thu Feb 27 2020 05:06:32 GMT+0800 (China Standard Time)

Flamegraph from running bustle's ready_heavy benchmark against flurry:

DQ · Answer 23 · Thu Feb 27 2020 05:54:07 GMT+0800 (China Standard Time)

That's a lot of time taken by guards, even though it appears you are running this against HashMapRef, so this is one guard for all get() calls in one iteration?

Jon Gjengset · Answer 24 · Thu Feb 27 2020 06:49:40 GMT+0800 (China Standard Time)

Yup, that's one guard pin for each iteration. I have a version that just repins every ~1k operations, and it does much better.

Jon Gjengset · Answer 25 · Thu Feb 27 2020 07:03:01 GMT+0800 (China Standard Time)

Here is 10 concurrent threads running bustle each repinning every 1024 operations:

DQ · Answer 26 · Thu Feb 27 2020 07:18:46 GMT+0800 (China Standard Time)

Interesting. For how much we've looked at Owned::new, put spends way more time with garbage collection in this. Does this update existing keys in insert?

Also how many elements are in the map at max? I'm wondering about get_node consisting mostly of loading a bin and spending little time in find, as well as the lack of transfer.

For a safety check, check_guard seems to take a lot of time as well.

Jon Gjengset · Answer 27 · Thu Feb 27 2020 07:31:40 GMT+0800 (China Standard Time)

This workload is very read-heavy. The mix includes both inserts (2%), updates (3%), and deletes (1%).

In this particular benchmark the table had an initial capacity of 2^29 with 75% of the keys populated before the benchmark began. Then the mix above ran 402653184 operations across 10 threads in 13.201507468s (time/op = 32ns).

I agree the lack of transfer seems odd here, not sure what that's about. I'm not terribly surprised that get_node dominates. I agree that the equality check on the guard is pretty unfortunate, but not sure how to fix it...

DQ · Answer 28 · Thu Feb 27 2020 08:11:02 GMT+0800 (China Standard Time)

If the table is presized to be this large and actually allocates the bins beforehand, then it never needs to transfer as 2^29 > 402653184. So that explains that.

Regarding get_node it's not that the method itself is being called a lot. But isn't is weird that inside it we spend almost no time finding nodes and almost all of the time loading bins? This is Table::bin, so it's only the load of the first BinEntry for some hash value. And if we never transfer, it's also never an additional look-up in a next table via Moved.

I don't have a good idea for the guard check in HashMap in general. But at least if we are accessing the map through a HashMapRef created by HashMap::pin() the guard should be safe, since it is provided by the map itself, right? If we can somehow distinguish between such HashMapRef and ones with user-provided guards, we could have unchecked implementations for public methods as pub(crate) fn unchecked_XYZ and make pub methods wrappers around these that call check_guard first and then the unchecked method. Then pin() created HashMapRef could use the unchecked methods directly since we know the guard is safe. It feels bad to do the check even if the user doesn't even supply a guard...

Jon Gjengset · Answer 29 · Thu Feb 27 2020 12:09:27 GMT+0800 (China Standard Time)

Ah, no the table is initialized with a capacity of 2^29, but then it is pre-filled with 75% of the capacity. And then it does all those operations, 2% of which are inserts. But yes, probably still explains no transfers.

I think that just suggests that most lookups find bins of length 1, where the first element matches. That would mean that we very rarely end up walking a bin through find. Which is arguably the intended operation — if bins were long, we should resize.

Hmm, yeah, I'm not sure. Having a private "I promise this Guard is fine" API that is called from the public APIs that contain the check seems pretty reasonable, though that then actually requires storing a HashMapRef over time to work. Even keeping the same Guard and then using HashMapRef::with_guard when you want a ref won't work, since it'd have to check. One option would be to define our own Guard type which wraps an epoch::Guard and a boolean. The first time we check a given Guard, we set the bool (which is private, so the user can't mess with it), and on subsequent uses of the same Guard, we just check the bool. Which should be a lot faster. It would mean that users couldn't use crossbeam-epoch::pin directly on our methods, but that might even be an advantage. We probably want them to default to using HashMap::guard to get a Guard anyway.

DQ · Answer 30 · Thu Feb 27 2020 18:39:08 GMT+0800 (China Standard Time)

Yeah, but still if a capacity of 2^29 is requested the first table that is created will immediately have this capacity. Maybe we need one resize if 75% is also our load factor, but I wouldn't expect that to show up.

Long bins are also interesting for hash collisions, because then we don't have a resize. But hopefully this picture is similar, at least once we have tree bins.

Is holding a HashMapRef a problem? I thought the whole idea was to re-use the guard, in which case both holding the ref and holding the guard pins the epoch.

If we wrap Guard, would we have to check at all? If the wrapper is only constructed by us (the HashMap), wouldn't we know it contains a valid guard since we put it in? I also don't know if I agree with you that we don't want to allow access with other guards. If a user has multiple things to access with guards (maybe just multiple instances of HashMap), they would have to register and de-register one guard per instance with epoch garbage collection.

I think we should either

only allow accessing the map with our own guard, in which case there are no checks, or
allow both our and other guards, if they match. For this I would not check the guard if we provide it, and check a user's guard at the time of creating the wrapper -- this way, the wrapper still only needs to contain the guard. Maybe the wrapper could even Deref to the guard, but that's just a thought.

I personally am in favour of the second option. What do you think?

Jon Gjengset · Answer 31 · Thu Feb 27 2020 22:20:08 GMT+0800 (China Standard Time)

Ah, but the HashMapRef has a lifetime, whereas the Guard does not. So it's generally going to be easier to hold the Guard.

Yes, you will still need to check when the Guard is first used, as otherwise a user could use the Guard from one map with the methods from another map. That's fine for now when they're all the global collector, but wouldn't be if we ever moved to a non-global collector.

One option is to have a method like HashMap::checked_guard which takes an epoch::Guard and returns a checked flurry::Guard assuming the given guard matches the inner collector. That would give users a way to re-use their own guards with our thing while avoiding the repeated checking cost.

I'm not entirely sure how you are suggesting the second option would work? Some kind of trait that we implement for both "our" guard and the epoch guard, with a method that returns whether it's been checked? That's probably going to cause all sorts of soundness issues, like the user implementing the trait themselves for their own type. I prefer requiring that the user must use our Guard type, and that we provide a convenient way to do so for an epoch::Guard.

One thing we have to be very careful with is that a user doesn't take a "checked" Guard from one map and re-uses it with another map. Not sure how we do that check though. A bool won't be enough.

DQ · Answer 32 · Thu Feb 27 2020 23:06:04 GMT+0800 (China Standard Time)

No no no, I did not want to implement this as a trait 😅
I think we were talking past each other a bit. What I meant with 2) is what you call check_guard, i.e. having our own guard type which checks the user's guard upon creation. I just wanted to say that in contrast, if the user calls HashMap::pin, we create the guard ourselves and don't have to check against the collector again.

I was also assuming that the collector will stay the same, since we currently explicitly don't provide a with_collector method. If the intention is to avoid checks for guards created by a map, it will be difficult to distinguish these guards. If we want that safety, this will probably always involve doing a check for every method call, the question is what to check. Would have to be either an identifier for the collector or the source map for which Eq can be checked faster than for collectors.

DQ · Answer 33 · Mon Mar 23 2020 19:14:44 GMT+0800 (China Standard Time)

Getting the cost of check_guard down has been on my mind recently. I'm continuing from ideas mentioned before: We would want to get away with a less costly check with each method call and could provide our own Guards to do so, but probably still need to distinguish different instances of HashMap, so our Guards are not cleared by one map and then used in another.

A naive idea would be to count the maps and give each an increasing ID, and then store in the Guard for which IDs it was already checked. This has the obvious issue of a (possibly large but) finite number of maps that can be created. But what we could do is have a pool of N ids, which can be assigned to a map and re-used if that map is dropped. A map with an ID could use the cheaper check against the ID after checking any Guard once, and maps without an ID can default to the expensive check until they are assigned an ID (or forever in the worst case). A map without an ID could try to obtain one whenever (or every n times) it has to perform the expensive check.

Remaining considerations with this:

The whole ID thing requires some amount of global state, which is not that nice and needs to be synchronized across threads. If maps only check for an available ID upon creation, this is probably not much of a problem, but if we want to have re-checks, we would have to make sure accessing this global state doesn't bottleneck everything else
We still would have to make sure that Guards are not considered "cleared" for a map that re-used an earlier index. An idea could be that a Guard cannot outlive any map he is cleared for. Maybe the guards could store references to the maps directly instead of the IDs, so the maps can't get dropped while the Guard still exists. This would have to be very explicit to the user however. Ultimately, I'm not sure what a good way to do this would be.

I hope this at least furthers the discussion and maybe produces some more ideas 😉

Jon Gjengset · Answer 34 · Mon Mar 23 2020 20:50:06 GMT+0800 (China Standard Time)

So, here's where this all gets weird to me — checking that a Guard is for a given Collector is just a pointer equality check, so I don't know that replacing it with a number would speed things up by much. They should both just be doing a straight comparison of a usize. There isn't even any pointer dereferencing there...

DQ · Answer 35 · Mon Mar 23 2020 21:10:55 GMT+0800 (China Standard Time)

That is weird indeed. Why then would it take up so much time? But yeah, if we not only change it to our own check, but have to introduce additional branching and bookkeeping overhead, it's likely that would not be faster. It just seems too much for a single comparison, given how involved some of the map methods are 🤔

Jon Gjengset · Answer 36 · Mon Mar 23 2020 21:37:18 GMT+0800 (China Standard Time)

It could be that it just happens so often. Maybe accessing the &Collector is what's expensive, I'm not sure? Some more benchmarking may be needed here.