grapheme_strlen shows different length of emoji ZWJ Sequence when compared to native
Luc45 opened this issue ยท comments
Take the following emoji for instance: ๐ฉโ๐ฉโ๐ฆโ๐ฆ
This emoji consists of four different emojis glued together by Zero Width Joiner characters, as seen on https://emojipedia.org/family-woman-woman-boy-boy/.
When checking the length with grapheme_strlen(), it returns 1, while this library returns 4.
This is possibly due to a bug on the GRAPHEME_CLUSTER_RX regex.
This bug should only happen on PCRE_VERSION < 8.32, however, when combined with the bug #369 , it applies to all PCRE_VERSION that contains a date timestamp, which seems to be the default format.
Therefore, the grapheme_strlen
function in this polyfill is likely to provide incorrect results, such as in this example:
Expected result grapheme_strlen('๐ฉโ๐ฉโ๐ฆโ๐ฆ')
:
The test is being conducted using the regex: \X
int(1)
int(1)
int(1)
int(1)
Actual result with the custom cluster grapheme_strlen('๐ฉโ๐ฉโ๐ฆโ๐ฆ')
:
The test is being conducted using the regex: (?:\r\n|(?:[ -~\x{200C}\x{200D}]|[แจ-แน]+|[แ-แ
]*(?:[๊ฐ๊ฐ๊ฐธ๊ฑ๊ฑฐ๊ฒ๊ฒจ๊ณ๊ณ ๊ณผ๊ด๊ดด๊ต๊ตฌ๊ถ๊ถค๊ท๊ท๊ทธ๊ธ๊ธฐ๊น๊นจ๊บ๊บ ๊บผ๊ป๊ปด๊ผ๊ผฌ๊ฝ๊ฝค๊พ๊พ๊พธ๊ฟ๊ฟฐ๋๋จ๋๋ ๋ผ๋๋ด๋๋ฌ๋๋ค๋
๋
๋
ธ๋๋ฐ๋๋จ๋๋ ๋ผ๋๋ด๋๋ฌ๋๋ค๋๋๋ธ๋๋ฐ๋๋จ๋๋ ๋ผ๋๋ด๋๋ฌ๋๋ค๋๋๋ธ๋๋ฐ๋๋จ๋๋ ๋ผ๋๋ด๋๋ฌ๋๋ค๋๋๋ธ๋๋ฐ๋๋จ๋๋ ๋ผ๋๋ด๋๋ฌ๋ ๋ ค๋ก๋ก๋กธ๋ข๋ขฐ๋ฃ๋ฃจ๋ค๋ค ๋คผ๋ฅ๋ฅด๋ฆ๋ฆฌ๋ง๋งค๋จ๋จ๋จธ๋ฉ๋ฉฐ๋ช๋ชจ๋ซ๋ซ ๋ซผ๋ฌ๋ฌด๋ญ๋ญฌ๋ฎ๋ฎค๋ฏ๋ฏ๋ฏธ๋ฐ๋ฐฐ๋ฑ๋ฑจ๋ฒ๋ฒ ๋ฒผ๋ณ๋ณด๋ด๋ดฌ๋ต๋ตค๋ถ๋ถ๋ถธ๋ท๋ทฐ๋ธ๋ธจ๋น๋น ๋นผ๋บ๋บด๋ป๋ปฌ๋ผ๋ผค๋ฝ๋ฝ๋ฝธ๋พ๋พฐ๋ฟ๋ฟจ์์ ์ผ์์ด์์ฌ์์ค์์์ธ์
์
ฐ์์จ์์ ์ผ์์ด์์ฌ์์ค์์์ธ์์ฐ์์จ์์ ์ผ์์ด์์ฌ์์ค์์์ธ์์ฐ์์จ์์ ์ผ์์ด์์ฌ์์ค์์์ธ์์ฐ์์จ์์ ์ผ์์ด์์ฌ์์ค์ ์ ์ ธ์ก์กฐ์ข์ขจ์ฃ์ฃ ์ฃผ์ค์คด์ฅ์ฅฌ์ฆ์ฆค์ง์ง์งธ์จ์จฐ์ฉ์ฉจ์ช์ช ์ชผ์ซ์ซด์ฌ์ฌฌ์ญ์ญค์ฎ์ฎ์ฎธ์ฏ์ฏฐ์ฐ์ฐจ์ฑ์ฑ ์ฑผ์ฒ์ฒด์ณ์ณฌ์ด์ดค์ต์ต์ตธ์ถ์ถฐ์ท์ทจ์ธ์ธ ์ธผ์น์นด์บ์บฌ์ป์ปค์ผ์ผ์ผธ์ฝ์ฝฐ์พ์พจ์ฟ์ฟ ์ฟผํํดํํฌํํคํํํธํํฐํ
ํ
จํํ ํผํํดํํฌํํคํํํธํํฐํํจํํ ํผํํดํํฌํํคํํํธํํฐํํจํํ ํผํํดํํฌํํคํํํธํํฐํํจํํ ํผํํดํํฌํ]?[แ
-แข]+|[๊ฐ-ํฃ])[แจ-แน]*|[แ-แ
]+|[^\p{Cc}\p{Cf}\p{Zl}\p{Zp}])[\p{Mn}\p{Me}\x{09BE}\x{09D7}\x{0B3E}\x{0B57}\x{0BBE}\x{0BD7}\x{0CC2}\x{0CD5}\x{0CD6}\x{0D3E}\x{0D57}\x{0DCF}\x{0DDF}\x{200C}\x{200D}\x{1D165}\x{1D16E}-\x{1D172}]*|[\p{Cc}\p{Cf}\p{Zl}\p{Zp}])
int(1)
int(4)
int(1)
int(4)
I forgot to share the code snippet used on the results above: https://3v4l.org/OPBFq#v8.0.10
Would you agree with considering that once #369 is merged, this issue can be closed? Aka we don't provide the most recent regexp to ppl that use older PCRE versions?
Alternatively, would you mind looking at improving this regexp? I'm sure I generated it but I don't remember how. There might be a script somewhere in this repo or mayne in https://github.com/tchwork/utf8
Thanks for asking my input.
This package requires PHP 7.1, which seems to use PCRE 8.38 according to 3v4l.org: https://3v4l.org/S1bPl
On the PHP versions made available by 3v4l, 8.32 is used on PHP versions bellow 5.5.9, but I'm not sure if this will always be the case.
Is it possible for PHP 7.1+ to be running PCRE 8.32..?
It seems PCRE 8.32 made it's way into PHP core in 2013: php/php-src@357ab3c
And has been replaced with 8.35 in 2014: php/php-src@dd0e96c
I guess it's fine to drop support for the old PCRE_VERSION. It would be ideal if this could be enforced in composer.json through ext-pcre
, but given the non-standard version number of PCRE, it can be challenging to enforce the versions.
https://jubianchi.github.io/semver-check/#/^10%20||%20^8.34/8.34%202013-12-15
Or "ext-pcre": "> 8.32"
:
https://jubianchi.github.io/semver-check/#/%3E%208.32/8.34%202013-12-15
Actually, only PCRE2 (10+) is able to handle the initial grapheme_strlen
example correctly: https://3v4l.org/grqP9
I'm going to close here because nobody worked on this. Ppl should upgrade to PCRE 10+ (or contribute a fix here ;) )