unicode: add UTF flag if subject is UTF

Question

unicode: add UTF flag if subject is UTF

rurban opened this issue 7 years ago · comments

if the pattern is not UTF8 (but ambivalent with \D\W...)
but the subject is, recompile with UTF and match.

failing re_tests:

\w	\x{200C}	yp	$&	\x{200C}
\W	\x{200C}	np	-	-
\w	\x{200D}	yp	$&	\x{200D}
\W	\x{200D}	np	-	-

/^\D{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-
/^\S{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-
/^\W{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-

# [ perl #114272]
\Vn	\xFFn/	yp	$&	\xFFn

a?\X	     a\x{100}	yp	$&	a\x{100}

Reini Urban · Answer 1 · Sun Apr 09 2017 17:38:02 GMT+0800 (China Standard Time)

plan for the implementation strategy:

if a pattern contains unicode classes like \w, \s \d, always compile with /u.
if the subject is ascii, compile again with /a and do the ascii match.
otherwise if the pattern is compiled /a and the subject is /u, re-compile again.
cache the optional second pattern. in pprivate as struct of compiled_ascii_pattern and compiled_uni_pattern, together with the engine. see e.g. re::engine::Hyperscan where I also store two ptrs in pprivate.
also cache statistics about asc/uni usage to make better predictions. (e.g. 2 more ints)

Todd Richmond · Answer 2 · Thu Sep 28 2023 04:01:02 GMT+0800 (China Standard Time)

any progress on this? I'm looking to use PCRE2 for better perf, but need mixed UTF8 (regex and subject) to work