PCRE2Project / pcre2

PCRE2 development is now based here.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistency with * or ? followed by \x{ffffffff} when caseless,ucp

addisoncrump opened this issue · comments

I confirmed this is unrelated to the changes applied in #350 -- this has been around since at least 10.40, but did not check before that version.

Consider the following regexes:

$ ./pcre2test -jit -32
PCRE2 version 10.43-DEV 2023-04-14 (32-bit)
  re> /a*\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
 0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
 0: \x{ffffffff}
data> 
  re> /k*\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
 0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
No match

It seems that the case-folded k has different behaviour to the case-folded a. Note that using e.g. b or \x{ff000000} in place of \x{ffffffff} does not exhibit the same behaviour:

  re> /k*\x{ff000000}/caseless,ucp
data> \x{ff000000}
 0: \x{ff000000}
data> \x{ff000000}\=no_jit
 0: \x{ff000000}

Nor does this issue appear standalone:

  re> /\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
 0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
 0: \x{ffffffff}

The issue also appears with 0-or-1 repetitions:

  re> /k?\x{ffffffff}/caseless,ucp
data> \x{ffffffff}
 0: \x{ffffffff}
data> \x{ffffffff}\=no_jit
No match

In their wisdom, the Unicode people have decreed that the ASCII letters k and s have more than one other case and this has caused no end of confusion and is a good source of bugs. This is yet another such and the fixes (there is more than one) are similar to #350 that is, do different things for characters greater than Unicode. However, I have now run out of time. Next week...

This is now fixed in ad73148.