aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml

Home Page:https://aantron.github.io/lambdasoup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CDATA is stripped

jsomers opened this issue · comments

Lambda Soup seems to have no notion of CDATA, and therefore ends up stripping its contents. For instance the following code:

printf
  "%s"
  ("<div>Hi<![CDATA[[Something or other]]></div>"
   |> Lambda_soup.Soup.parse
   |> Lambda_soup.Soup.to_string);

results in:

<div>Hi</div>

But I would have expected:

<div>Hi Something or other</div>

or something similar.

@jsomers, CDATA is not allowed in HTML5. It is only allowed in "foreign elements" (SVG, MathML, etc.) embedded in HTML. If CDATA is found in HTML, the parser is supposed to treat it as a comment. From https://www.w3.org/TR/html52/syntax.html#markup-declaration-open-state (after reading <!):

Otherwise, if there is an adjusted current node and it is not an element in the HTML namespace and the next seven characters are a case-sensitive match for the string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after), then consume those characters and switch to the CDATA section state.

Otherwise, this is a parse error. Create a comment token whose data is the empty string. Switch to the bogus comment state (don’t consume anything in the current state).

@aantron Ah, maybe my example was misleading, because the real tag that had the CDATA in it was not in the HTML namespace: <ac:plain-text-body><![CDATA[[content was here]]></ac:plain-text-body>.

In this case the characters are being stripped, too, but it sounds like from the spec you've quoted that they're supposed to be consumed in some kind of special mode (and therefore that they should be retrievable somehow)?

Thanks!

What language is that tag from?

"Standard" HTML5 only allows HTML, SVG, and MathML, does not recognize xmlns declarations, and treats all elements that aren't in <svg> or <math> as generic, unknown HTML elements in the HTML namespace. See the note a bit down in this section: https://www.w3.org/TR/html52/syntax.html#writing-html-documents-elements.

Here's another: https://www.w3.org/TR/html52/syntax.html#cdata-sections, CDATA sections allowed only in MathML and SVG. Per the previous paragraph, HTML5 has a pretty restricted way of knowing when content is MathML or SVG.

There are a couple more sections about this, but the spec is pretty convoluted and I don't want to dump the whole reasoning out here :)

I don't mean to stick too closely to the spec. It may be worthwhile to support this. Perhaps we can modify the parser so that it takes an optional parameter that enables CDATA in HTML.

(and therefore that they should be retrievable somehow)?

Sorry, forgot to reply to this. Yes, they should be available in a comment, but Lambda Soup drops comments :) Perhaps we need to keep comments instead.

Yes, they should be available in a comment, but Lambda Soup drops comments :) Perhaps we need to keep comments instead.

That would be great!

What language is that tag from?

It's emitted by a wiki product called Atlassian Confluence, which is in wide use. Their use of CDATA is definitely bizarre, but it would be great if it were at least possible to get at it via Lambda Soup.

Thanks 😄

@jsomers Is Atlassian Conference emitting HTML or XML? If HTML, is it well-formed enough to pass as XML? I ask because if that's the case, it should be possible to process the input as XML using Lambda Soup's underlying parser, with some variation of this: http://aantron.github.io/lambda-soup/#VALfrom_signals. That way, we can avoid adding quirks handling for some app's non-standard-compliant output :)

@aantron It's technically XML, but they call it "XHTML-based" because it's mostly HTML with some custom (non-spec-compliant) stuff thrown in.

@jsomers, that's good news, and I'm glad Atlassian documents what format they actually produce. You should be able to parse it, with CDATA, by

let soup =
  Markup.string your_xml_string |> Markup.parse_xml |> Markup.signals |> Soup.from_signals in
...

The resulting value will be a Soup.(soup node), the same type as if you had used Soup.parse your_xml_string, except this code parses the string as XML instead of HTML5, so it should be more appropriate to the inputs you have.

@aantron, got it, that works---thanks very much! Feel free to close the issue.

Excellent!