matchSet.addX: hotspot
yosiat opened this issue · comments
Hi,
I am doing some performance tests to see If I can use this library instead something I built.
On most of my benchmarks Quamina is faster, there are two main spots which caused terrible degradation:
- Flattening - I am passing large objects (7kb~ in size, with nesting) and flattening kills the performance, but since I know the list of fields (& paths) being searched on, I can reduce the incoming event to a smaller one probably.
matchSet.addX
- currently it's duplicating the matchSet (even if there are no new matches), this is causing a major degradation.
I did change locally of addX to update in place instead of creating new matchSet, and this is the perf difference I see:
Before:
Benchmark_Quamina-10 27054 49177 ns/op 43311 B/op 160 allocs/op
After:
Benchmark_Quamina-10 221080 6717 ns/op 8932 B/op 28 allocs/op
From what I see, it's done for concurrency concern, but looking at -
Line 151 in 4477484
And maybe simple RWMutex can help us instead of duplication?
I can submit a PR to fix this, but trying to understand what are the option we have in hand.
First of all, on the flattening, I suspect it's going to be hard to improve things very much. The code in flatten_json that skips over unwanted fields is pretty fast so it may not be cost-effective to write code to remove fields from events before testing them.
I think the best approach to speeding up flattening would be to use protobufs or Avro or something where there are pointers to the fields so you don't have to scan over all the bytes that aren't interesting. Several people have told me that they think in principle it should be straightforward to write such a hyper-efficient flattener, but AFAIK nobody has done it yet.
Oh, and I should say, thanks for the input!
On the concurrency stuff… that's surprising. Could I ask a favor? Go into concurrency_test.go and try increasing the intensity by increasing the "n" and "tasks" variables, with your update-in-place change?
The concurrency we're worried about is between AddPattern()
and MatchesForEvent()
. There can only be one AddPattern
thread but there can be many MatchesForEvent
. First symptom was one of the MatchesForEvent
coroutines reading a map[]
while an AddPattern
goroutine was updating it, the runtime throws a specific panic. Second symptom: Same, only for a slice, except this eventually causes an invalid-index panic. The second symptom was much rarer and harder to reproduce.
The update-by-replace in matchSet
and other places fixed the first problem, and then the atomic.Value
in coreMatcher
, fieldMatcher
, and valueMatcher
fixed the second. Possibly the atomic.Value
also removed the need for the fancy matchSet
? That doesn't seem true in my mind, but I could easily be wrong.
I wonder if the case where matchSet.addX()
is adding no new values is easy to detect and simply return the input? I'll have a look at that.
BUT now it dawns on me that the matchSet
that is used only in the AddPattern
thread, which is the variable matches
in matchesForSortedFields()
is accessed only in that thread and I'm sure it could safely use update-in-place. If you wanted to create a new type for that purpose, I'm pretty sure I'd take that PR as a first step.
Hi,
Thanks for the responses! I will give them a detailed look & answers tomorrow (or the day after).
I wonder if the case where matchSet.addX() is adding no new values is easy to detect and simply return the input? I'll have a look at that.
From my data, before:
Quamina-10 26460 47247 ns/op 43318 B/op 160 allocs/op
After adding len
check on exes
in addX
:
Quamina-10 35757 33191 ns/op 30393 B/op 129 allocs/op
It gives a good improvement, but it's not the one I need in order to replace my existing implementation.
If you wanted to create a new type for that purpose, I'm pretty sure I'd take that PR as a first step.
You are suggesting a new type which the same as matchSet
but does update-in-place? it sounds ok, but what do you think about two methods / one method ?
// one method option
addX(updateInPlace bool, exes ...X) *matchSet
// two methods option
addX(exes ...X) *matchSet
unsafeAddX(exes ...X) *matchSet // does update-in-place
First of all, on the flattening, I suspect it's going to be hard to improve things very much. The code in flatten_json that skips over unwanted fields is pretty fast so it may not be cost-effective to write code to remove fields from events before testing them.
About flatenning, I admit I haven't looked too much at the code what it returns and how does it works, but in my solution I have am getting those large events and I managing to work with them really efficiently.
I achieve that by using jsoniter
/ jx
which allows me to read subset of the event, and instead of flattening I am doing lazy-reads against that JSON.
What helped me a lot here, is to re-strucutre my object, so instead of:
{
"field1": "a", // have match against
"field2": "b", // have match against
... rest of the 7kb data ..
}
I am doing:
{
"context": { "field1": "a", "field2": "b" },
"payload": { ... 7kb data .. }
}
With this approach jsoniter/jx don't need to traverse all of the properties, just get to context
property and that's it.
I'll give flattening a deeper look and see why this re-structuring don't help Quamina, since essentially the flattener should skip the whole "payload" object.
Again, thanks for your feedback and for this awesome library!
Haven't seen your code but I assume you mean something like
if len(exes) == 0 {
return m
}
then yes, please include that in any PR.
Yes, two methods on the existing matchSet class is probably better. Maybe addXSingleThreaded()
or some such?
Don't know jsoniter or jx but I'm super-interested in anything that could make flattening faster because last time I profiled it was still >50% of matching latency. So I look forward to hearing what you discover.
Submitted PR for matchSet
changes - #109, by the way I see there are no formatting enforced with gofmt
- would you accept a PR for doing gofmt
on the files and ensuring that on CI?
Regarding the flattener, I have looked into it, and given this pattern:
{
"context": {
"field1": ["a"],
"field2": ["b"]
}
}
Event:
{
"context": { "field1": "a", "field2": "b" },
"payload": { ... 7kb data .. }
}
It looks like the flattener goes into "payload" and lookup for properties called "context" / "field1" / "field2", which makes it not very efficient for my case (and maybe others as well), instead what I suggest is to:
- Keep track on "paths" for patterns - in my case it's
context.field1
/context.field2
- In flattening phase, pluck only those paths and ignore all others.
Good catch on gofmt. I have my IDE wired up for gofmt-on-save and my bad for not noticing the miss. Yess, I'd love a CI PR for that. But I thought we already had something - @embano1 am I wrong on that?
Yeah, at the current time the flattener traverses the event end-to-end to be sure it hasn't missed any relevant fields. Which gives me an idea: enhance the NameTracker
interface to give a list of all the fields in play and then the Flattener could know when it's found everything and stop looking. I'd be a little worried that for short simple events this could actually slow things down, so it might make sense to have a heuristic and only do this check for events larger than a certain threshold.
Good catch on gofmt. I have my IDE wired up for gofmt-on-save and my bad for not noticing the miss. Yess, I'd love a CI PR for that. But I thought we already had something - @embano1 am I wrong on that?
It's because the linter wasn't enabled in golangci, I have enabled it here - #110 and fixed all linting errors.
enhance the NameTracker interface to give a list of all the fields in play and then the Flattener could know when it's found everything and stop looking.
This exactly what I have in mind, how do you think to implement such an API? it means we need to store list of pattern fields in a coreMatcher
and pass them via NameTracker
interface. The problems come in hand is how to handle deletions, because then we need some kind of ref-count which makes thing complex.
I'll try to implement a POC of external flattener which accepts a list of paths and will do flattening using jx / jsoniter I'll check small objects (where patterns covers 90% of properties) and large objects and will report back with code-sample and my findings.
Some quick update..
I wrote another flattener based on Jx, which accepts "paths" (tree based paths needed pluck) -
goos: darwin
goarch: arm64
pkg: github.com/yosiat/quamina-flatenner
Benchmark_Quamina_JxFlattener-10 143412 7351 ns/op 24 B/op 3 allocs/op
Benchmark_Quamina_Flattener-10 83698 14388 ns/op 360 B/op 23 allocs/op
Benchmark_LargePayload_QuaminaFlattner-10 77971 15326 ns/op 1240 B/op 44 allocs/op
Benchmark_LargePayload_JxFlattner-10 145306 8318 ns/op 904 B/op 24 allocs/op
PASS
ok github.com/yosiat/quamina-flatenner 5.347s
Tomorrow I'll run Quamina flattener tests against it and make sure it's compliant and then I'll push the code to github (to a repo) so you can give a look at it.
Update
It can be improved further -
goos: darwin
goarch: arm64
pkg: github.com/yosiat/quamina-flatenner
Benchmark_Quamina_JxFlattener-10 2612365 437.9 ns/op 24 B/op 3 allocs/op
Benchmark_Quamina_Flattener-10 83246 14347 ns/op 360 B/op 23 allocs/op
Benchmark_LargePayload_QuaminaFlattner-10 78272 15307 ns/op 1240 B/op 44 allocs/op
Benchmark_LargePayload_JxFlattner-10 876013 1365 ns/op 904 B/op 24 allocs/op
PASS
ok github.com/yosiat/quamina-flatenner 5.727s
What is the first column in the benchmark output? Ideally we'd like to run some of the existing Citylots benchmarks replacing json_flattener with yours and compare the performance.
Just one caveat: I am very very reluctant to add dependencies to Quamina, because I see this as a very low-level and horizontal library. I have spent years fighting through dependency-management hell and seen enough horrible security disasters that I am personally reluctant to adopt libraries that have much in the way of dependencies.
What is the first column in the benchmark output?
It's the standard output of go benchmarks, it's the number of loops.
Ideally we'd like to run some of the existing Citylots benchmarks replacing json_flattener with yours and compare the performance.
Where they exists? I can run them and publish results here until I publish the full code.
Just one caveat: I am very very reluctant to add dependencies to Quamina, because I see this as a very low-level and horizontal library.
Totally makes sense and acceptable, I used Jx since it was faster and easier to get to a result which shows my point, I assume the changes I did can be done with encoding/json
, it was too complex for me to change existing code to get to my result.
Once I publish my code, I believe you will understand easily what I did and how I improved the performance and what you can do in order to adjust existing code for it.
Have a look at benchmarks_test.go. Unfortunately I previously didn't know about the built-in Go benchmarking support so they don't yet take advantage of that.
Cool, I looked at it and I'll need to do some adjustments to my code to make it work.
Going to sleep, will do it tomorrow ~
This exactly what I have in mind, how do you think to implement such an API? it means we need to store list of pattern fields in a
coreMatcher
and pass them viaNameTracker
interface. The problems come in hand is how to handle deletions, because then we need some kind of ref-count which makes thing complex.
The first/easiest thing I thought of would be to have an API in NameTracker
like
func (nt *NameTracker) GetFieldSet() map[string]bool
The idea is it would return a "set" of all the field names that are used and, for convenience, the set would be writeable, so whenever you encounter one in the element you remove it from the set and as soon as the set is empty you stop parsing the event. It would be a bit expensive to generate but you're only going to be using this API on big events, so maybe OK? But I didn't think a lot, quite likely you have a better idea.
BTW, I think your technique of moving the interesting fields up to the front of the event is very clever, but probably not a reasonable thing to ask users to do in the general case. But this new API might be a good idea anyhow.
Hi,
Ran the benchmarks_test.go
I saw only one benchmark for BigShellStyle -
# Quamina
Field matchers: 27 (avg size 1.000, max 1), Value matchers: 1, SmallTables 54 (avg size 15.500, max 28), singletons 0
428,547.72 matches/second with letter patterns
# Jx
Field matchers: 27 (avg size 1.000, max 1), Value matchers: 1, SmallTables 54 (avg size 15.500, max 28), singletons 0
1,367,947.02 matches/second with letter patterns
I have published the code up here - https://github.com/yosiat/quamina-flatenner
I made some hacks in Quamina to expose some internal methods in order to run the benchmarks externally, so it will be be a bit hacky to run it locally, but what's needed is:
- Clone the quamina-flatenner project
- Install https://github.com/rogpeppe/gohack
- Run
go build
and thengohack get -vcs -f github.com/timbray/quamina
- In the gohack folder, checkout this branch - https://github.com/yosiat/quamina/tree/flat-hack,
flat-hack
. - And then simply
go test -v
I'll make this flow easier later today, instead of external repo I'll put inside a fork branch.
Some notes:
- In my "flat-hack" branch, I added used paths map, same as
namesUsed
in order to get list of paths passed to flatenner. - Haven't check the full correctness of my flatenner - the
benchamrks_test.go
and my little tests are passing, but hopefully soon in a fork branch I'll be able to get existing tests running (and then passing ;) ) - The benchmarks I posted above, currently I can't share because they contain internal data, I'll anonymise it and will post the benchmarks as well.
A bit about the source:
- https://github.com/yosiat/quamina-flatenner/blob/main/paths.go - Is a tree structure for indexing the paths, so I can traverse along while traversing objects in flatenner.
- https://github.com/yosiat/quamina-flatenner/blob/main/jx.go - Is the actual flatenner which uses jx, I copied some methods from Quamina mostly to help with ArrayTrail logic.
Update:
I pushed the sources to my side branch, so it will be easier to use.
- Checkout
https://github.com/yosiat/quamina/tree/flat-hack
- Run
go build
(to installjx
) - To run benchmarks, I prefixed them with
Test_JX
- sogo test -v -run "^Test_JX"
will suffice.
Oops… the benchmark we most care about isn't in benchmarks_test (oops, sorry) it's TestCityLots
in quamina_test.go. Very strongly dependent on flattener performance.
I have to go do other stuff for a while, will get back to this.
Ok, now it's getting a smaller difference -
=== RUN TestCityLots
Field matchers: 7 (avg size 1.500, max 3), Value matchers: 6, SmallTables 0 (avg size n/a, max 0), singletons 6
173,434.09 matches/second
--- PASS: TestCityLots (1.64s)
=== RUN TestCityLots_JX
Field matchers: 7 (avg size 1.500, max 3), Value matchers: 6, SmallTables 0 (avg size n/a, max 0), singletons 6
177,609.63 matches/second
And the profiles are the same, most of the time is on storeArrayElementField
.
Yeah, the Patterns applied in that benchmark force the flattener to process the whole record, which includes some large-ish arrays of floating-point numbers, so it's pretty well a worst-case. But actually, there are two distinct problems:
- Making the flattener efficient when processing lots of fields
- Making the flattener smarter about skipping fields it doesn't need to look at (what you've been working on mostly)
OK, so once again, thanks for this work - once we've landed this PR I'll add you as a project contributor. Now, a confession: You've been doing so much work that I've sort of lost track of which PRs I should review and when. Feel free to send me a message, email or Signal, when you'd like me to take a close look at something.
OK, so once again, thanks for this work - once we've landed this PR I'll add you as a project contributor.
Thanks a lot!
I've sort of lost track of which PRs I should review and when
In terms of PRs, we have:
- #112 - fixing linter
- #109 - matchset.addX performance, once the linter fixing is merged I'll rebase and adapt
And then we need to decide on how to proceed with the flattener changes: My approach is clear - track list of paths used and flatten only them, but I am not sure how to proceed here:
- Tracking list of paths is easy when we are considering additions only, but when we consider removals (of patterns) is becomes complex, because then we need to make sure to remove a path only when it's not used by all patterns. I haven't looked at how deletions work currently, so I have no idea on this.
- My flattener uses Jx and you said that you don't want an external dependency which makes sense, doing this with existing flattener will be complex, I can try to make it work but it will make some time to get my around existing parser.
So my work around flattener will allow me to use an external one (which is my one) to continue conducting my tests in the internal project and see the e2e performance of it. But I am not sure it's ready for something that we can Merge/Review.
Feel free to send me a message, email or Signal.
Thanks! I'll keep the discussions here in the open in GitHub and if case an arise I'll send you an email (I have sent you an email, so you will have my private as one).
We can probably close this issue, but maybe start another one on flattening, because there is useful discussion in here. I just had another idea: Instead of making NameTracker
smarter, we could consider a Flattener
implementation that took config options, for example "when you come to this field you can stop, nothing useful after it". Because it sounds like skipping parts of the event is only really useful when you have hand-crafted events and the caller has inside knowledge about the structure.
Closing this issue and suggested opened new for flattening - #113.