matches(of:) and firstMatch(of:) behave differently when multiline is used with negative lookahead

Question

matches(of:) and firstMatch(of:) behave differently when multiline is used with negative lookahead

fwgreen opened this issue 2 years ago · comments

import Foundation

let regex = /(?m)^\b.*(^(?!SERIAL|TT).+)/

let string = """
CONT NEG TESTS - ALL PINS

SERIAL NUMBER: 15      MODULE: 0          Mon Jan  1 03:41:42 2007
TT   STMT PIN MEAS VALUE      FORCING         LESS THAN  GREATER THAN
---- ---- --- --------------- --------------- ---------- ------------
"""

for match in string.matches(of: regex) {
    print(match.output.1)
}

print("")

print(string.firstMatch(of: regex)?.output.1 ?? "")

Expected output:

CONT NEG TESTS - ALL PINS

CONT NEG TESTS - ALL PINS

Actual output:

CONT NEG TESTS - ALL PINS
---- ---- --- --------------- --------------- ---------- ------------

CONT NEG TESTS - ALL PINS

While removing multiline fixes it, this is only one capture group of a larger regex; and it works as expected, with multiline, in both Java (on my local machine) and the PCRE2 flavor on Regex101.com.

Nate Cook · Answer 1 · Tue Nov 29 2022 00:04:02 GMT+0800 (China Standard Time)

This is caused by the default word-boundary algorithm used by Regex, which recognizes Unicode "default" word boundaries instead of the "simple" word boundaries used by other regular expression engines. In particular, the default word-boundary algorithm always sees the position at the start and end of a string, and before and after line breaks, as a word boundary, whereas the simple algorithm sees them as word boundaries only when followed (or preceded, as relevant) by a "word character" (i.e. a character that is matched by \w).

To get the behavior you're expecting, you can switch to use simple boundaries by calling .wordBoundaryKind(.simple) on your regex:

let regex = /(?m)^\b.*(^(?!SERIAL|TT).+)/
    .wordBoundaryKind(.simple)