swiftlang / swift-experimental-string-processing

An early experimental general-purpose pattern matching engine for Swift.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regex character classes fail with single scalar, non-NFC range bound elements

kasei opened this issue · comments

Using a non-NFC, single-scalar code point like U+F900 as the start of a character class range causes an error:

1 | let r = #/[\u{F900}-\u{FDCF}]/#
  |           `- error: cannot parse regular expression: invalid bound for character class range

Tested with swift 5.10 and 6.0 (Xcode 16b2 16A5171r):

swift-driver version: 1.90.11.1 Apple Swift version 5.10 (swiftlang-5.10.0.13 clang-1500.3.9.4)
Target: arm64-apple-macosx14.0

swift-driver version: 1.110 Apple Swift version 6.0 (swiftlang-6.0.0.4.52 clang-1600.0.21.1.3)
Target: arm64-apple-macosx14.0

This seems to be because U+F900 is not in NFC, normalizing to U+8C48. I find this surprising, because while this code point is not in NFC, this character class range isn't ambiguous as other non-NFC cases might be (e.g. using a decomposed combination or U+F900 as a literal instead of with the \u escape).

I am trying to port older code that uses NSRegularExpression, and this seems to be a blocker to moving away from the old APIs (short of expanding ranges like this into non-range classes of thousands of individual scalars).