mb_strlen returns 0 instead of 1 for the char chr(254)

Question

mb_strlen returns 0 instead of 1 for the char chr(254)

alexchuin opened this issue 2 years ago · comments

Hi,

With the polyfill, var_dump(mb_strlen(chr(254))) return 0.
With the php8.0-mbstring extension, var_dump(mb_strlen(chr(254))) return 1;

versions : * v1.26.0

Thanks,
Alex

yannouche34490 · Answer 1 · Fri Jan 27 2023 03:42:47 GMT+0800 (China Standard Time)

Hi,
I’ve made some tests and iconv_strlen, who’s used by mb_strlen, fails for every chr() from 128 to 255. This plus function’s return type int probably casting false value returned by iconv_strlen to 0 explains thé issue.

I’ve found a trick by using utf8_encode, but it will deprecated with php 8.2.

Christophe Coevoet · Answer 2 · Fri Jan 27 2023 18:10:06 GMT+0800 (China Standard Time)

utf8_encode is not a solution to your problem. It just breaks other cases (the name of that function does not describe what it does)

Christophe Coevoet · Answer 3 · Fri Jan 27 2023 18:13:56 GMT+0800 (China Standard Time)

@alexchuin the issue is that chr(254) is not a valid UTF-8 string. So trying to compute its length in the UTF-8 encoding does not make any sense.
Could it be that when you try that with the actual mbstring, you have a php.ini when the default encoding used by mbstring is not UTF-8 but something else ?

yannouche34490 · Answer 4 · Fri Jan 27 2023 19:36:44 GMT+0800 (China Standard Time)

utf8_encode is not a solution to your problem. It just breaks other cases (the name of that function does not describe what it does)

Yes, I know, that’s why I didn’t wrote a PR. My point is maybe somerhing from ut8_encode source code is « portable » in this context to handle edge-cases like this.

Christophe Coevoet · Answer 5 · Fri Jan 27 2023 19:39:34 GMT+0800 (China Standard Time)

@yannouche34490 your code using utf8_encode is not fixing the issue. It only computes the length of a different string.

yannouche34490 · Answer 6 · Fri Jan 27 2023 19:44:29 GMT+0800 (China Standard Time)

@yannouche34490 your code using utf8_encode is not fixing the issue. It only computes the length of a different string.

Yes, we have to find a proper way to detect these cases.

Christophe Coevoet · Answer 7 · Fri Jan 27 2023 19:58:58 GMT+0800 (China Standard Time)

@yannouche34490 utf8_encode is not the tool for that either. This function is really convert_latin1_to_utf8 (which is precisely why it is deprecated).
And if you really want to work with latin1, then use strlen($string) (as latin1 is not a multibyte encoding) or mb_strlen($string, 'ISO-8859-1').

The issue here is that it looks like mb_strlen does not validate that the string is valid in the encoding it uses when computing the length, while iconv_strlen does. So passing an invalid UTF-8 string produces a different behavior in the polyfill than in the extension. we would have to implement a fallback for cases where iconv fails to have an actual polyfill.