apple / swift-experimental-string-processing

An early experimental general-purpose pattern matching engine for Swift.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regex with positive lookahead crashes at runtime when accessing match.output

AndreasVerhoeven opened this issue · comments

Description

Using a regex with a positive lookahead sometimes crashes @ runtime. See the example in the reproduction

Reproduction

let regex = /(?=([1-9]|(a|b)))/
let input = "Something 9a"
let matches = input.matches(of: regex)
for match in matches {
	print(match.output) // accessing `.output` here crashes at runtime: Thread 1: EXC_BREAKPOINT (code=1, subcode=0x225246848)
}

Stack dump

Thread 1: EXC_BREAKPOINT (code=1, subcode=0x225246848)

Expected behavior

No crash

Environment

swift-driver version: 1.87.1 Apple Swift version 5.9 (swiftlang-5.9.0.128.108 clang-1500.0.40.1)
Target: arm64-apple-macosx14.0

Additional information

No response

This also reproduces with just /(?=(9))/ for the regex.

Don't have a fix yet, but found the cause... The issue appears to be that a positive lookahead is implemented as:

      ...
0:    save(restoringAt: success)
1:    save(restoringAt: intercept)
2:    <sub-pattern>    // any failure restores at 'intercept'
3:    clearThrough(intercept) // remove intercept and any leftovers from <sub-pattern>
4:    fail             // -> 'success'
5:  intercept:
6:    clearSavePoint   // remove 'success' restore point 
7:    fail             // propagate failure
8:  success:
      ...

The fail at (4) is the path of success through the lookahead – that instruction drops the position (and other state) back to where it was at the start of the lookahead pattern, and then moves the instruction pointer to (0), which advances the instruction pointer to (8), where pattern matching continues. Unfortunately, the state restoration in the fail also resets the capture group information, erasing any capture data that was saved while matching the lookahead pattern.

When you try to access the match output, the loss of that capture data causes a runtime failure, since any successful match must have both the overall range (empty in this case) and the capture formed during the lookahead (which is just the 9 in this simplified regex).