UTF8 capture problem
todd-richmond opened this issue · comments
There is a common re::engine::* bug where RXf_MATCH_UTF8 flag is not being set on the perl regex object to ensure that all captures are correctly computed as UTF8 when the input is UTF8. There are 2 critical issues involved that are fixed by this
-
All captures as well as ${^PREMATCH} and ${^POSTMATCH} will correctly have their utf8 bits set
-
$+[0] and $-[0] (offsets of captures) will be computed correctly for utf8 chars rather than byte offset. When these are wrong, it is impossible to compute a substring for match in the original text instead of using ${^POSTMATCH} which is required due to a horrific perf problem
XS code will need to do something like this
#ifdef RXf_UTF8
if (flags & RXf_UTF8)
extflags |= RXf_MATCH_UTF8;
#else
if (SvUTF8(pattern))
extflags |= RXf_MATCH_UTF8;
#endif
@todd-richmond did you file a ticket with perl for this. This is the first time I've heard about it
Also the engine does not get the utf8 flag set yet. #15
@demerphq I think this is an issue with this module, not perl, but not 100% positive
I filed a similar patch against re-engine-re2 and have been using that in production for several years. Without it, all captures are corrupt if the pattern is UTF8