Null alternation causes infinite loop
rctcwyvrn opened this issue · comments
The first regex runs instantly as expected, but the second one appears to run forever. The first has @
in the alternation while the second has it outside as an anchor.
Running them on regex101 shows that their engines run the two regexes in a similar number of cycles and match the same results.
func testInfiniteLoop() {
print("Starting regex")
// let regex = try! Regex(#"(?:\d|\w|\.|-|_|%|\+|@)+"#)
let regex = try! Regex(#"(?:\d|\w|\.|-|_|%|\+|)+@(?:\d|\w|\.|-|_|%|\+|)+"#)
print(try! regex.firstMatch(in: "lily.test.email@test.com"))
print("Finished regex")
}
There was a typo in my regex, the last alternation in (?:\d|\w|\.|-|_|%|\+|)
is empty and that seems to be the issue
This regex runs correctly #"(?:\d|\w|\.|-|_|%|\+)+@(?:\d|\w|\.|-|_|%|\+)+"#
This regex loops forever #"(\w|)+"#
The loop is caused by DSLTree.Node.empty
emitting no instructions, resulting in the alternation instruction jumping to the instruction after the alternation instead of the second alternative.
More precisely, what we currently do is
For each alternation except for the last one
- Emit a save point to the next alternation
- Emit instructions for the current alternation
- Branch to end
Then for the last alternation just emit it's instructions (if it fails, the entire alternation fails, if it succeeds just fall through to the next block)
However if we have an empty node as the last block it emits no instructions, so the previous alternation's save point doesn't save to the final alternation but rather the first instruction AFTER the alternations. In the example regex this ends up being a splitSaving
instruction emitted by the quantification instruction, which is intended to attempt to eagerly match as much as possible.
So what happens is that the first alternation (\w
) fails, it tries to jump to the last alternation (.empty
), but instead it jumps to the splitSaving
instruction which is normally only reached if matching was successful, so it tries to match again and loops forever.
The solution is simple: Filter out any .empty
nodes from the alternation before we emit them
There are many other edge cases that result in the same issue as this empty alternative
(?:\w|(?i-i:))+
last alternative is a non-capturing group that only changes matching options and who's inner node is.empty
(?:\w|(?#comment))+
last alternative is atrivia
(?:\w|(?#comment)(?i-i:))+
last alternative is a concatenation that only contains nodes that don't emit instructions(?:\w|(?i))+
last alternative is an.changeMatchingOptions
Atom
The most optimal way of doing this would be to not emit anything when we have nodes that don't produce any instructions, however this results in us having to make sure that this emitsInstructions
property on DSLTree.Node
stays in sync with any changes we make to ByteCodeGen
var emitsInstructions: Bool {
switch self {
case .trivia(_), .empty: return false
case .orderedChoice(let nodes):
return nodes.any { $0.emitsInstructions }
case .concatenation(let nodes):
return nodes.any { $0.emitsInstructions }
case .convertedRegexLiteral(let node, _):
return node.emitsInstructions
case .nonCapturingGroup(let kind, let child):
switch kind.ast {
case .lookahead, .negativeLookahead, .lookbehind, .negativeLookbehind,
.capture, .namedCapture, .balancedCapture, .atomicNonCapturing:
return true
case .changeMatchingOptions: return child.emitsInstructions
// Note: emitNoncapturingGroup has a fixme for this case
default: return child.emitsInstructions
}
case .atom(let atom):
switch atom {
case .changeMatchingOptions: return false
default: return true
}
default: return true
}
}
let nodes = children.filter { $0.emitsInstructions }
if nodes.count == 0 {
return
}
if nodes.count == 1 {
try emitNode(nodes.first!)
return
}
The alternative to this is to just make .trivia
and .empty
emit a nop
. This would mean that emitNode
always emits at least one instruction so we don't have to worry about all the edge cases (I think). I think being slightly less efficient here is a worthy tradeoff
This is actually a more fundamental issue with how we do our quantification, not the lack of emitting instructions.
(\b)+
should match every boundary in the string, however the way we emit quantifiers doesn't handle cases where we match the subexpression but don't advance our input, in those cases we need to recognize that we're doing a zero length match and branch out of the quantification loop