Wrong charset breaks parsing

Question

Wrong charset breaks parsing

raxbg opened this issue 6 years ago · comments

I know there are some other issues related to charsets already, but I had an issue where the parser was simply not able to get past a comment block containing multi byte characters. I did not know the correct charset, but that was another issue.

Do you think that something like this will work in all cases raxbg@33b4306 ? It worked in my case and performance is also much better than using mb_substr(). Actually performance does not seem to be affected by this change.

Ivailo Hristov · Answer 1 · Wed Sep 19 2018 19:05:10 GMT+0800 (China Standard Time)

I will be closing this. The method mentioned above has issues. It is much better to use mb_convert_encoding() to convert from whatever the source encoding is back to utf-8 and then use the parser.

Raphael Schweikert · Answer 2 · Wed Sep 19 2018 19:07:06 GMT+0800 (China Standard Time)

Ok, thanks. I hope to revive #116, which does exactly that IIRC…

Ivailo Hristov · Answer 3 · Wed Sep 19 2018 19:41:50 GMT+0800 (China Standard Time)

To be honest, I will be afraid to merge this PR in my production environment now that I have seemingly working charset detection. Mostly because everything seems to be working pretty well for UTF-8 encoded strings. Converting the source to UTF-8 beforehand seems to be enough. If @skodak is willing to check the changes against the latest version that will be okay with me, but simply merging/rebasing the proposed changes seems scary at this point 😛