mb_strlen returns 0 instead of 1 for the char chr(254)
alexchuin opened this issue · comments
Hi,
With the polyfill, var_dump(mb_strlen(chr(254)))
return 0
.
With the php8.0-mbstring extension, var_dump(mb_strlen(chr(254))) return 1
;
versions : * v1.26.0
Thanks,
Alex
Hi,
I’ve made some tests and iconv_strlen, who’s used by mb_strlen, fails for every chr() from 128 to 255. This plus function’s return type int probably casting false value returned by iconv_strlen to 0 explains thé issue.
I’ve found a trick by using utf8_encode, but it will deprecated with php 8.2.
utf8_encode
is not a solution to your problem. It just breaks other cases (the name of that function does not describe what it does)
@alexchuin the issue is that chr(254)
is not a valid UTF-8 string. So trying to compute its length in the UTF-8 encoding does not make any sense.
Could it be that when you try that with the actual mbstring, you have a php.ini when the default encoding used by mbstring is not UTF-8
but something else ?
utf8_encode
is not a solution to your problem. It just breaks other cases (the name of that function does not describe what it does)
Yes, I know, that’s why I didn’t wrote a PR. My point is maybe somerhing from ut8_encode source code is « portable » in this context to handle edge-cases like this.
@yannouche34490 your code using utf8_encode
is not fixing the issue. It only computes the length of a different string.
@yannouche34490 your code using
utf8_encode
is not fixing the issue. It only computes the length of a different string.
Yes, we have to find a proper way to detect these cases.
@yannouche34490 utf8_encode
is not the tool for that either. This function is really convert_latin1_to_utf8
(which is precisely why it is deprecated).
And if you really want to work with latin1, then use strlen($string)
(as latin1 is not a multibyte encoding) or mb_strlen($string, 'ISO-8859-1')
.
The issue here is that it looks like mb_strlen
does not validate that the string is valid in the encoding it uses when computing the length, while iconv_strlen
does. So passing an invalid UTF-8 string produces a different behavior in the polyfill than in the extension. we would have to implement a fallback for cases where iconv fails to have an actual polyfill.