rurban / re-engine-PCRE2

use pcre-jit instead of slow perl regex

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unicode: add UTF flag if subject is UTF

rurban opened this issue · comments

if the pattern is not UTF8 (but ambivalent with \D\W...)
but the subject is, recompile with UTF and match.

failing re_tests:

\w	\x{200C}	yp	$&	\x{200C}
\W	\x{200C}	np	-	-
\w	\x{200D}	yp	$&	\x{200D}
\W	\x{200D}	np	-	-

/^\D{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-
/^\S{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-
/^\W{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-

# [ perl #114272]
\Vn	\xFFn/	yp	$&	\xFFn

a?\X	     a\x{100}	yp	$&	a\x{100}

plan for the implementation strategy:

  • if a pattern contains unicode classes like \w, \s \d, always compile with /u.
    if the subject is ascii, compile again with /a and do the ascii match.
  • otherwise if the pattern is compiled /a and the subject is /u, re-compile again.
  • cache the optional second pattern. in pprivate as struct of compiled_ascii_pattern and compiled_uni_pattern, together with the engine. see e.g. re::engine::Hyperscan where I also store two ptrs in pprivate.
  • also cache statistics about asc/uni usage to make better predictions. (e.g. 2 more ints)

any progress on this? I'm looking to use PCRE2 for better perf, but need mixed UTF8 (regex and subject) to work