firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

False positive match for certain unicode characters

kkmuffme opened this issue · comments

Bug Description

Reproduction steps

PHP >= 7.3

put regex:

a[–]b

for text:

a–b

Expected Outcome

No match.
Only when selecting the "u" flag it should match.

See php/php-src#14306

Browser

Chrome 124

OS

Win 11

Not a bug, see my reply in the php/php-src issue.

I think you misunderstood the issue - it's NOT a bug in PHP (which is why I closed the issue in php-src before your reply already, as I realized that was the case).

But it's a bug in regex101 - bc regex101 shows a match even WITHOUT u flag (while PHP does not). This behavior difference is the bug.

Ah, I see what you mean...
I'm guessing perhaps the WASM version implicitly supports non-ASCII strings? But I'm not sure what flavor library is involved here, or if it's a custom build specifically for the site.

Sorry @working-name, does seem there is a problem here after all 😓

Can this be because of the fact that the website uses UTF-16 while php uses UTF-8?

Possibly.
I guess the solution is similar to what happens already now when you use a[💩]b => when the "u" flag is not set I see 2 ? boxes - when I select "u" it shows the emoji correctly.
This is a 4 byte emoji, while is 3 bytes.

Please reopen the issue.

Just tested and this bug exists for ALL multi-byte characters even basic UTF-8.
e.g. /a[ä]b/ for text aäb => in regex101 shows a match, in PHP it's not a match.

REASON WHY:
PHP preg is single byte only, unless "u" flag is provided.
utf-8 = 2 byte, utf-16 = 3 byte, utf-32 = 4 byte

Since 💩 is 4 byte it works correctly as it's not valid in utf-16 - but is 3 byte and ä is 2 byte, therefore it's a valid in utf-16 which leads to this false positives

Pattern and subject strings are treated as UTF-8

https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Solution: convert pattern and text into ISO-8859-1 without "u" flag and to "utf-8" with "u" flag?