Encoding

Question

Encoding

mracis opened this issue 10 years ago · comments

Does the Reader detect the encoding of the article?
Always choing "�" on text (ReadSharp 6.0.0.0) from for example that source: http://www.jn.pt/PaginaInicial/Politica/Interior.aspx?content_id=3996648&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+JN-ULTIMAS+%28JN+-+Ultimas%29

Http Header - content type="ISO-8859-1" and
Html metadata - content-type="UTF-8"

I think is "Encondings.ISO88591" that is changing content.

tobi · Answer 1 · Fri Jul 18 2014 18:02:07 GMT+0800 (China Standard Time)

I can confirm that it doesn't work and will look into this issue!

tobi · Answer 2 · Fri Jul 18 2014 19:49:43 GMT+0800 (China Standard Time)

Problem with your provided link is, that the HTTP and HTML encodings are different.
I have already handled this case in previous version and decided to take the HTML encoding, as I think it's more appropriate.
In your link the HTML encoding is false, and the ISO-8859-1 encoding (taken from the HTTP headers) is the correct one.

That's an issue raised by the website, not by ReadSharp.

What have I done?

As I cannot know which is the correct encoding, I'll leave the preferred HTML encoding by default.
But I've included a new toggle PreferHTMLEncoding in the ReadOptions, which you can set to false, and your link will be decoded correctly.

So you can decide for yourself what you want to use. Or just set it to false in case you are parsing from the domain jn.pt.

I have included this use-case in the tests

Bye Tobi

Another Guy · Answer 3 · Tue Jul 29 2014 23:24:44 GMT+0800 (China Standard Time)

Thank you very much. Nice library.
Bye Ricardo