Inconsistency with * or ? followed by \x{ffffffff} when caseless,ucp
addisoncrump opened this issue · comments
I confirmed this is unrelated to the changes applied in #350 -- this has been around since at least 10.40, but did not check before that version.
Consider the following regexes:
$ ./pcre2test -jit -32
PCRE2 version 10.43-DEV 2023-04-14 (32-bit)
re> /a*\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
0: \x{ffffffff}
data>
re> /k*\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
No match
It seems that the case-folded k
has different behaviour to the case-folded a
. Note that using e.g. b
or \x{ff000000}
in place of \x{ffffffff}
does not exhibit the same behaviour:
re> /k*\x{ff000000}/caseless,ucp
data> \x{ff000000}
0: \x{ff000000}
data> \x{ff000000}\=no_jit
0: \x{ff000000}
Nor does this issue appear standalone:
re> /\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
0: \x{ffffffff}
The issue also appears with 0-or-1 repetitions:
re> /k?\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
No match
In their wisdom, the Unicode people have decreed that the ASCII letters k and s have more than one other case and this has caused no end of confusion and is a good source of bugs. This is yet another such and the fixes (there is more than one) are similar to #350 that is, do different things for characters greater than Unicode. However, I have now run out of time. Next week...
This is now fixed in ad73148.