rurban / re-engine-PCRE2

use pcre-jit instead of slow perl regex

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UTF8 capture problem

todd-richmond opened this issue · comments

There is a common re::engine::* bug where RXf_MATCH_UTF8 flag is not being set on the perl regex object to ensure that all captures are correctly computed as UTF8 when the input is UTF8. There are 2 critical issues involved that are fixed by this

  1.  All captures as well as ${^PREMATCH} and ${^POSTMATCH} will correctly have their utf8 bits set
    
  2.  $+[0] and $-[0] (offsets of captures) will be computed correctly for utf8 chars rather than byte offset. When these are wrong, it is impossible to compute a substring for match in the original text instead of using ${^POSTMATCH} which is required due to a horrific perf problem
    

XS code will need to do something like this

#ifdef RXf_UTF8
if (flags & RXf_UTF8)
extflags |= RXf_MATCH_UTF8;
#else
if (SvUTF8(pattern))
extflags |= RXf_MATCH_UTF8;
#endif

@todd-richmond did you file a ticket with perl for this. This is the first time I've heard about it

Also the engine does not get the utf8 flag set yet. #15

@demerphq I think this is an issue with this module, not perl, but not 100% positive
I filed a similar patch against re-engine-re2 and have been using that in production for several years. Without it, all captures are corrupt if the pattern is UTF8