matches(of:) and firstMatch(of:) behave differently when multiline is used with negative lookahead
fwgreen opened this issue · comments
import Foundation
let regex = /(?m)^\b.*(^(?!SERIAL|TT).+)/
let string = """
CONT NEG TESTS - ALL PINS
SERIAL NUMBER: 15 MODULE: 0 Mon Jan 1 03:41:42 2007
TT STMT PIN MEAS VALUE FORCING LESS THAN GREATER THAN
---- ---- --- --------------- --------------- ---------- ------------
"""
for match in string.matches(of: regex) {
print(match.output.1)
}
print("")
print(string.firstMatch(of: regex)?.output.1 ?? "")
Expected output:
CONT NEG TESTS - ALL PINS
CONT NEG TESTS - ALL PINS
Actual output:
CONT NEG TESTS - ALL PINS
---- ---- --- --------------- --------------- ---------- ------------
CONT NEG TESTS - ALL PINS
While removing multiline fixes it, this is only one capture group of a larger regex; and it works as expected, with multiline, in both Java (on my local machine) and the PCRE2 flavor on Regex101.com.
This is caused by the default word-boundary algorithm used by Regex
, which recognizes Unicode "default" word boundaries instead of the "simple" word boundaries used by other regular expression engines. In particular, the default word-boundary algorithm always sees the position at the start and end of a string, and before and after line breaks, as a word boundary, whereas the simple algorithm sees them as word boundaries only when followed (or preceded, as relevant) by a "word character" (i.e. a character that is matched by \w
).
To get the behavior you're expecting, you can switch to use simple boundaries by calling .wordBoundaryKind(.simple)
on your regex:
let regex = /(?m)^\b.*(^(?!SERIAL|TT).+)/
.wordBoundaryKind(.simple)