CDATA is stripped

Question

CDATA is stripped

jsomers opened this issue 6 years ago · comments

Lambda Soup seems to have no notion of CDATA, and therefore ends up stripping its contents. For instance the following code:

printf
  "%s"
  ("<div>Hi<![CDATA[[Something or other]]></div>"
   |> Lambda_soup.Soup.parse
   |> Lambda_soup.Soup.to_string);

results in:

<div>Hi</div>

But I would have expected:

<div>Hi Something or other</div>

or something similar.

Anton Bachin · Answer 1 · Fri Sep 21 2018 14:27:27 GMT+0800 (China Standard Time)

@jsomers, CDATA is not allowed in HTML5. It is only allowed in "foreign elements" (SVG, MathML, etc.) embedded in HTML. If CDATA is found in HTML, the parser is supposed to treat it as a comment. From https://www.w3.org/TR/html52/syntax.html#markup-declaration-open-state (after reading <!):

Otherwise, if there is an adjusted current node and it is not an element in the HTML namespace and the next seven characters are a case-sensitive match for the string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after), then consume those characters and switch to the CDATA section state.

Otherwise, this is a parse error. Create a comment token whose data is the empty string. Switch to the bogus comment state (don’t consume anything in the current state).

James Somers · Answer 2 · Fri Sep 21 2018 22:39:08 GMT+0800 (China Standard Time)

@aantron Ah, maybe my example was misleading, because the real tag that had the CDATA in it was not in the HTML namespace: <ac:plain-text-body><![CDATA[[content was here]]></ac:plain-text-body>.

In this case the characters are being stripped, too, but it sounds like from the spec you've quoted that they're supposed to be consumed in some kind of special mode (and therefore that they should be retrievable somehow)?

Thanks!

Anton Bachin · Answer 3 · Sat Sep 22 2018 02:42:17 GMT+0800 (China Standard Time)

What language is that tag from?

"Standard" HTML5 only allows HTML, SVG, and MathML, does not recognize xmlns declarations, and treats all elements that aren't in <svg> or <math> as generic, unknown HTML elements in the HTML namespace. See the note a bit down in this section: https://www.w3.org/TR/html52/syntax.html#writing-html-documents-elements.

Here's another: https://www.w3.org/TR/html52/syntax.html#cdata-sections, CDATA sections allowed only in MathML and SVG. Per the previous paragraph, HTML5 has a pretty restricted way of knowing when content is MathML or SVG.

There are a couple more sections about this, but the spec is pretty convoluted and I don't want to dump the whole reasoning out here :)

I don't mean to stick too closely to the spec. It may be worthwhile to support this. Perhaps we can modify the parser so that it takes an optional parameter that enables CDATA in HTML.

Anton Bachin · Answer 4 · Sat Sep 22 2018 02:42:59 GMT+0800 (China Standard Time)

(and therefore that they should be retrievable somehow)?

Sorry, forgot to reply to this. Yes, they should be available in a comment, but Lambda Soup drops comments :) Perhaps we need to keep comments instead.

James Somers · Answer 5 · Tue Sep 25 2018 02:39:18 GMT+0800 (China Standard Time)

Yes, they should be available in a comment, but Lambda Soup drops comments :) Perhaps we need to keep comments instead.

That would be great!

What language is that tag from?

It's emitted by a wiki product called Atlassian Confluence, which is in wide use. Their use of CDATA is definitely bizarre, but it would be great if it were at least possible to get at it via Lambda Soup.

Thanks 😄

Anton Bachin · Answer 6 · Tue Sep 25 2018 03:29:45 GMT+0800 (China Standard Time)

@jsomers Is Atlassian Conference emitting HTML or XML? If HTML, is it well-formed enough to pass as XML? I ask because if that's the case, it should be possible to process the input as XML using Lambda Soup's underlying parser, with some variation of this: http://aantron.github.io/lambda-soup/#VALfrom_signals. That way, we can avoid adding quirks handling for some app's non-standard-compliant output :)

James Somers · Answer 7 · Wed Sep 26 2018 04:43:24 GMT+0800 (China Standard Time)

@aantron It's technically XML, but they call it "XHTML-based" because it's mostly HTML with some custom (non-spec-compliant) stuff thrown in.

Anton Bachin · Answer 8 · Wed Sep 26 2018 05:01:28 GMT+0800 (China Standard Time)

@jsomers, that's good news, and I'm glad Atlassian documents what format they actually produce. You should be able to parse it, with CDATA, by

let soup =
  Markup.string your_xml_string |> Markup.parse_xml |> Markup.signals |> Soup.from_signals in
...

The resulting value will be a Soup.(soup node), the same type as if you had used Soup.parse your_xml_string, except this code parses the string as XML instead of HTML5, so it should be more appropriate to the inputs you have.

James Somers · Answer 9 · Fri Sep 28 2018 01:00:16 GMT+0800 (China Standard Time)

@aantron, got it, that works---thanks very much! Feel free to close the issue.

Anton Bachin · Answer 10 · Fri Sep 28 2018 01:01:21 GMT+0800 (China Standard Time)

Excellent!