Wrong charset breaks parsing
raxbg opened this issue · comments
I know there are some other issues related to charsets already, but I had an issue where the parser was simply not able to get past a comment block containing multi byte characters. I did not know the correct charset, but that was another issue.
Do you think that something like this will work in all cases raxbg@33b4306 ? It worked in my case and performance is also much better than using mb_substr()
. Actually performance does not seem to be affected by this change.
I will be closing this. The method mentioned above has issues. It is much better to use mb_convert_encoding()
to convert from whatever the source encoding is back to utf-8 and then use the parser.
Ok, thanks. I hope to revive #116, which does exactly that IIRC…
To be honest, I will be afraid to merge this PR in my production environment now that I have seemingly working charset detection. Mostly because everything seems to be working pretty well for UTF-8 encoded strings. Converting the source to UTF-8 beforehand seems to be enough. If @skodak is willing to check the changes against the latest version that will be okay with me, but simply merging/rebasing the proposed changes seems scary at this point 😛