[Benchmark] Excessive (and expensive) backtracking

Question

[Benchmark] Excessive (and expensive) backtracking

milseman opened this issue 2 years ago · comments

let pattern = "^ +A"
let regex1 = try! NSRegularExpression(pattern: pattern)
let regex2 = try! Regex(pattern)
let numberOfSpaces = 50000
let testString = String(repeating: " ", count: numberOfSpaces)
let date = Date()
print(regex1.firstMatch(in: testString, range: .init(location: 0, length: numberOfSpaces)) == nil)
print(-date.timeIntervalSinceNow)
print(try! regex2.firstMatch(in: testString) == nil)
print(-date.timeIntervalSinceNow)

Michael Ilseman · Answer 1 · Thu Jun 16 2022 00:44:14 GMT+0800 (China Standard Time)

@rctcwyvrn, can you add this to the benchmark suite you're working on? Lots of opportunities exposed here.

We don't have a compiler IR yet, so we don't want to layer on excess complexity while the execution model is still settling down. But, there's a ton of low-hanging fruit here.

Processor.consume() calculates the remaining size of the input slice. Switching to index(offsetBy:limitedBy:) removes this extra linear measurement, which is what was creating the quadratic factor. Changing this gave me a very easy 7x improvement for this specific benchmark (locally, via package build), which would only improve with input length.

@Azoy, can you try to spin up a PR incorporating this patch and make sure we're test the boundaries correctly?

firstMatch is needlessly spinning up a new executor and processor for every single input position, only to fail right away on the anchor. We should integrate first-match with the processor so it can just reset its state.

This patch gives an an additional 10% improvement on this after the 7x from above from re-using the same executor. It still creates a new processor every time though, avoiding that would give us another 30% beyond the above.

@rctcwyvrn Can you work with me on integrating it? Can you also work with me on crafting targeted benchmarks that demonstrates these prominently, as well as reusing the same processor?

After that, we're getting absolutely killed by ARC, which accounts for ~70% of the remaining time after the above improvements.

Part of this is how we store Characters to compare against in registers, while we should instead store trivial values when possible. There are a few ways to go about this and I'll try to think of a simple way that's unlikely to conflict with future directions. This will also speed up creating a new processor.

We still end up ARCing the executor after the above fixes, and somehow Processor.matchSeq heavily ARCs itself. We also heavily ARC MEProgram and others.

And there are lots of dumb ARC going on for things like regex.initialOptions.semanticLevel inside inner loops, etc., which we can hoist or code around.

@rctcwyvrn, would you be interested in tracking these down with me and fixing them?

Intermediate-term, we should peephole optimize a single-scalar normalization-invariant Character match into a consumeScalar instruction (where the scalar value is encoded into the instruction's payload) followed by a grapheme-boundary check. This is broadly useful and should improve pretty much every regex in practice with verbatim components.

@Azoy and @natecook1000, can you think through where we'd need to check for grapheme cluster boundaries? Where or how could those checks could be elided? And how we could make them faster?

Similarly for verbatim sequences.

We should explore instructions for simple repetitions, especially those around a single-scalar Character. This benchmark doesn't yet demonstrate the need, but hopefully after tackling the other low-hanging fruit this becomes more salient.

We can also exploit the fact that grapheme breaking for . or any newline-excluding class is trivial on ASCII-only inputs.

Similarly, we need much cheaper save point representations and backtracking.

We should switch at some point to prioritize or advantage contiguous validly-encoded UTF-8. That is, we branch once at the top and not during every read. This will require some more design and architecture.

Currently, the low-hanging fruit will drown out the improvements here and this will add complexity, so it makes sense to wait until we tackle those other issues.

After we have an IR, we should do post-dominating terminal analysis so that we can replace consume loops with scanning, eliding needless save points and backtracking.

Similarly for eliding grapheme boundary checks and save points. It makes sense to wait until we settle our execution model a little before doing this.

Similarly for anchor analysis and exiting early for first-match style operations.

Michael Ilseman · Answer 2 · Sat Jun 18 2022 00:49:25 GMT+0800 (China Standard Time)

20x improvement to this micro (nano?) benchmark we extracted from this report: #494

Michael Ilseman · Answer 3 · Sat Jun 18 2022 03:42:34 GMT+0800 (China Standard Time)

A compounding 2x in #495

Michael Ilseman · Answer 4 · Sat Jun 18 2022 05:29:41 GMT+0800 (China Standard Time)

A compounding 2-7x, depending, in #497