Large regression in word boundary regexes

Question

Large regression in word boundary regexes

rctcwyvrn opened this issue 2 years ago · comments

I was running the benchmarker for this regex #"<(\w*)\b[^>]*>(.*?)<\/\1>"# which uses \b to match the end of a html tag and noticed it was running really slow

ed842cb

Running
- htmlAll 11.8ms

main

Running
- htmlAll 3.08s

Some amount of regression was expected with the implementation of the new word breaking algorithm but a 300x slowdown seems unacceptable. A quick profile shows that ~99% of the time is spent in AssertFunction, with 90% of that being String._wordIndex(after:) and 10% being Set.insert

cc @Azoy @milseman

Michael Ilseman · Answer 1 · Wed Jul 13 2022 01:18:39 GMT+0800 (China Standard Time)

@Azoy is this because the SPI is inefficient, or any thoughts on what to do here?

Alejandro Alonso · Answer 2 · Wed Jul 13 2022 03:18:04 GMT+0800 (China Standard Time)

yeah the current implementation of String.isOnWordBoundary in this repo is really inefficient and was fully expecting perf to be pretty bad. Once _nearestWordIndex(atOrBelow:) is fixed, I think this operation will get considerably faster.