unicode: add UTF flag if subject is UTF
rurban opened this issue · comments
Reini Urban commented
if the pattern is not UTF8 (but ambivalent with \D\W...)
but the subject is, recompile with UTF and match.
failing re_tests:
\w \x{200C} yp $& \x{200C}
\W \x{200C} np - -
\w \x{200D} yp $& \x{200D}
\W \x{200D} np - -
/^\D{11}/a \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF} np - -
/^\S{11}/a \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF} np - -
/^\W{11}/a \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF} np - -
# [ perl #114272]
\Vn \xFFn/ yp $& \xFFn
a?\X a\x{100} yp $& a\x{100}
Reini Urban commented
plan for the implementation strategy:
- if a pattern contains unicode classes like \w, \s \d, always compile with /u.
if the subject is ascii, compile again with /a and do the ascii match. - otherwise if the pattern is compiled /a and the subject is /u, re-compile again.
- cache the optional second pattern. in pprivate as struct of compiled_ascii_pattern and compiled_uni_pattern, together with the engine. see e.g. re::engine::Hyperscan where I also store two ptrs in pprivate.
- also cache statistics about asc/uni usage to make better predictions. (e.g. 2 more ints)
Todd Richmond commented
any progress on this? I'm looking to use PCRE2 for better perf, but need mixed UTF8 (regex and subject) to work