aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml

Home Page:https://aantron.github.io/lambdasoup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Non-breaking space Unicode byte A0 gets mangled

arkdae opened this issue · comments

commented

I am making use of Lamdasoup by way of the static site generator Soupault. I'm using Kramdown to convert plain text into HTML. Most of the output has HTML entities, for example smart quotes and ellipses. However non-breaking spaces are output as one byte with the hex value of A0 instead of  . This is getting mangled as seen here:

utop # Soup.parse "<html><body>Test\xA0test</body></html>" |> Soup.pretty_print;;
- : string =
"<html>\n <head></head>\n <body>\n  Test�test\n </body>\n</html>\n"

The issue is that the default input character set for HTML is UTF-8.

0xA0 is the numeric value of a non-breaking space character, but its UTF-8 encoding is C2 A0.

A0 on its own is not valid UTF-8. Lambda Soup (actually, its undrelying parser Markup.ml) is reading the A0 and replacing it with the Unicode replacement character, with numeric value 0xFFFD and UTF-8 encoding EF BF BD. This behavior is correct according to the HTML spec.

Can you configure Kramdown to either emit HTML entities, or to emit UTF-8?

Alternatively, Lambda Soup and Markup.ml can be used to read input in a different encoding. A bare A0 byte is not any Unicode encoding, but I believe it is valid ISO 8859-1. It's not recommended to use this, but if Kramdown cannot be told to write Unicode, you could try replacing Soup.parse your_input by

your_input
|> Markup.string
|> Markup.parse_html ~encoding:Markup.Encoding.iso_8859_1
|> Markup.signals
|> Soup.from_signals
commented

I guess this must be a Kramdown bug, then. Because it already emits HTML entities for single and double quotes and ellipses among other things. I was surprised that it provided this single A0 byte when everything else in the output is plain 7-bit ASCII.

I've just been piping the output from Kramdown through a sed script which converts the A0 byte into &nbsp;

Update:
I found another solution, run Kramdown like so:

kramdown --entity-output :symbolic

I still say this is a bug in Kramdown, however because where the non-breaking space is being generated is in footnotes, but also in those footnotes it is adding another HTML entity &amp#8617; which looks like a carriage return symbol for linking back to the origin of the footnote. It doesn't make sense and seems inconsistent to me that I have to add this option for only one place (so far that I have found) to ensure HTML entities are generated instead of higher-valued bytes of an unspecified encoding.

But I'll leave it alone for now.

Final update:
I guess it is not a bug in Kramdown. If I change my environment to LANG=en_US.UTF-8, then Kramdown outputs the two bytes C2 A0. So it was outputting ISO-8859-1 simply because that is what my environment was set to.

Maybe an enhancement to lambdasoup would be to honor the environment's encoding?

Thanks for looking into this!

Maybe an enhancement to lambdasoup would be to honor the environment's encoding?

Strictly speaking, it wouldn't be an enchancement. In the HTML spec, the parsing algorithm does not depend on the user's environment. I also don't think it's something that users expect. The vast majority are using UTF-8 and it would be surprising for their code to behave differently when deployed to another machine, potentially to another user, whose environment happens to be configured differently, even when the input data is exactly the same.

There is an encoding detection procedure, but it works by assuming 7-bit ASCII and trying to find a <meta> tag before restarting parsing, or by looking for Unicode byte order marks. Those are absent in the vast majority of inputs Lambda Soup sees, so, in practice, Lambda Soup assumes UTF-8, though it can be forced to read just about any other encoding.