ceee / ReadSharp

:rooster: Extract meaningful website contents using a port of NReadability

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Encoding

mracis opened this issue · comments

Does the Reader detect the encoding of the article?
Always choing "�" on text (ReadSharp 6.0.0.0) from for example that source: http://www.jn.pt/PaginaInicial/Politica/Interior.aspx?content_id=3996648&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+JN-ULTIMAS+%28JN+-+Ultimas%29

Http Header - content type="ISO-8859-1" and
Html metadata - content-type="UTF-8"

I think is "Encondings.ISO88591" that is changing content.

commented

I can confirm that it doesn't work and will look into this issue!

commented

Problem with your provided link is, that the HTTP and HTML encodings are different.
I have already handled this case in previous version and decided to take the HTML encoding, as I think it's more appropriate.
In your link the HTML encoding is false, and the ISO-8859-1 encoding (taken from the HTTP headers) is the correct one.

That's an issue raised by the website, not by ReadSharp.

What have I done?

As I cannot know which is the correct encoding, I'll leave the preferred HTML encoding by default.
But I've included a new toggle PreferHTMLEncoding in the ReadOptions, which you can set to false, and your link will be decoded correctly.

So you can decide for yourself what you want to use. Or just set it to false in case you are parsing from the domain jn.pt.

I have included this use-case in the tests

Bye Tobi

Thank you very much. Nice library.
Bye Ricardo