apple / swift-experimental-string-processing

An early experimental general-purpose pattern matching engine for Swift.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Null alternation causes infinite loop

rctcwyvrn opened this issue · comments

commented

The first regex runs instantly as expected, but the second one appears to run forever. The first has @ in the alternation while the second has it outside as an anchor.

Running them on regex101 shows that their engines run the two regexes in a similar number of cycles and match the same results.

  func testInfiniteLoop() {
    print("Starting regex")
    // let regex = try! Regex(#"(?:\d|\w|\.|-|_|%|\+|@)+"#)
    let regex = try! Regex(#"(?:\d|\w|\.|-|_|%|\+|)+@(?:\d|\w|\.|-|_|%|\+|)+"#)
    print(try! regex.firstMatch(in: "lily.test.email@test.com"))
    print("Finished regex")
  }
commented

There was a typo in my regex, the last alternation in (?:\d|\w|\.|-|_|%|\+|) is empty and that seems to be the issue

This regex runs correctly #"(?:\d|\w|\.|-|_|%|\+)+@(?:\d|\w|\.|-|_|%|\+)+"#

This regex loops forever #"(\w|)+"#

commented

The loop is caused by DSLTree.Node.empty emitting no instructions, resulting in the alternation instruction jumping to the instruction after the alternation instead of the second alternative.

More precisely, what we currently do is

For each alternation except for the last one

  1. Emit a save point to the next alternation
  2. Emit instructions for the current alternation
  3. Branch to end

Then for the last alternation just emit it's instructions (if it fails, the entire alternation fails, if it succeeds just fall through to the next block)

However if we have an empty node as the last block it emits no instructions, so the previous alternation's save point doesn't save to the final alternation but rather the first instruction AFTER the alternations. In the example regex this ends up being a splitSaving instruction emitted by the quantification instruction, which is intended to attempt to eagerly match as much as possible.

So what happens is that the first alternation (\w) fails, it tries to jump to the last alternation (.empty), but instead it jumps to the splitSaving instruction which is normally only reached if matching was successful, so it tries to match again and loops forever.

The solution is simple: Filter out any .empty nodes from the alternation before we emit them

commented

There are many other edge cases that result in the same issue as this empty alternative

  • (?:\w|(?i-i:))+ last alternative is a non-capturing group that only changes matching options and who's inner node is .empty
  • (?:\w|(?#comment))+ last alternative is a trivia
  • (?:\w|(?#comment)(?i-i:))+ last alternative is a concatenation that only contains nodes that don't emit instructions
  • (?:\w|(?i))+ last alternative is an .changeMatchingOptions Atom

The most optimal way of doing this would be to not emit anything when we have nodes that don't produce any instructions, however this results in us having to make sure that this emitsInstructions property on DSLTree.Node stays in sync with any changes we make to ByteCodeGen

  var emitsInstructions: Bool {
    switch self {
    case .trivia(_), .empty: return false
    case .orderedChoice(let nodes):
      return nodes.any { $0.emitsInstructions }
    case .concatenation(let nodes):
      return nodes.any { $0.emitsInstructions }
    case .convertedRegexLiteral(let node, _):
      return node.emitsInstructions
    case .nonCapturingGroup(let kind, let child):
      switch kind.ast {
      case .lookahead, .negativeLookahead, .lookbehind, .negativeLookbehind,
          .capture, .namedCapture, .balancedCapture, .atomicNonCapturing:
        return true
      case .changeMatchingOptions: return child.emitsInstructions
        // Note: emitNoncapturingGroup has a fixme for this case
      default: return child.emitsInstructions
      }
    case .atom(let atom):
      switch atom {
      case .changeMatchingOptions: return false
      default: return true
      }
    default: return true
    }
  }
    let nodes = children.filter { $0.emitsInstructions }
    if nodes.count == 0 {
      return
    }
    if nodes.count == 1 {
      try emitNode(nodes.first!)
      return
    }

The alternative to this is to just make .trivia and .empty emit a nop. This would mean that emitNode always emits at least one instruction so we don't have to worry about all the edge cases (I think). I think being slightly less efficient here is a worthy tradeoff

commented

This is actually a more fundamental issue with how we do our quantification, not the lack of emitting instructions.

(\b)+ should match every boundary in the string, however the way we emit quantifiers doesn't handle cases where we match the subexpression but don't advance our input, in those cases we need to recognize that we're doing a zero length match and branch out of the quantification loop