How to dump out CDATA element

Question

How to dump out CDATA element

la10736 opened this issue 8 years ago · comments

Is there some way to prevent < > substitution for CDATA text element?

add a text node like <![CDATA[Data]]> will give:

<?xml version=\'1.0\'?><FILE><CONTENT_TYPE>&lt;![CDATA[Data]]&gt;</CONTENT_TYPE></FILE>

In this case I'll get back <![CDATA[Data]]> as CONTENT_TYPE content instead of simply Data.

Can I work around on this behaviour or where I should take a look to implement it?

Thanks

Jake Goulding · Answer 1 · Fri Nov 11 2016 02:01:31 GMT+0800 (China Standard Time)

May I ask why the encoding of this matters? In the past, when people have asked for similar features, it's because they are dealing with some other aspect of a pipeline that does not properly handle XML or has the wrong amount of encoding. For example, they might be using a regex to look for strings.

for CDATA text element?

There is no such thing as a "CDATA text element". CDATA is an artifact of encoding an XML structure to a text form. When you say you want to "add a text node like <![CDATA[Data]]>", that seems very suspicious because you are trying to deal with the encoded form of the data at a higher level.

instead of simply Data.

That seems to indicate that you want to just add a text node with the content "Data".

Michele d'Amico · Answer 2 · Fri Nov 11 2016 04:38:24 GMT+0800 (China Standard Time)

Ok ... you busted me :)... I'm dealing with legacy project that use CDATA almost everywhere. My xml messages can contain a lot of node that embed raw data and html in the text nodes.

Unfortunately I need also to send back the messages and not just parse the incoming.

Jake Goulding · Answer 3 · Fri Nov 11 2016 06:04:32 GMT+0800 (China Standard Time)

My xml messages can contain a lot of node that embed raw data and html in the text nodes.

OK, that's one of the few use-cases where this type of output can occur. Essentially, there's XML structures that have been encoded as text and then that text has been placed in another XML structure which was then encoded as text.

However, that's exactly why the < has to be encoded as <. Given the encoded XML:

<CONTENT_TYPE>&lt;![CDATA[Data]]&gt;</CONTENT_TYPE>

this corresponds to a Rust value like

Element {
    name: "CONTENT_TYPE",
    children: vec![Text("<![CDATA[Data]]>")],
)

However, if the encoded XML were:

<CONTENT_TYPE><![CDATA[Data]]></CONTENT_TYPE>

Then the Rust value would look like:

Element {
    name: "CONTENT_TYPE",
    children: vec![Text("Data")],
)

Said another way, these are very different representations, and you cannot simply toggle some output flag to switch between them.

The main option that could make sense in this area would be smarter analysis of when to use CDATA in output. For example, if we have to escape < as <, that's 3 extra bytes per escaped character. If we escape as <![CDATA[<]]>, thats 12 extra bytes, but only paid once.

We could also have an option that always used CDATA. Neither of these options seem to fit your problem though.

Michele d'Amico · Answer 4 · Fri Nov 11 2016 06:42:45 GMT+0800 (China Standard Time)

Ok ... I'l thought about that... maybe I'll try to introduce a new node like RawText that wrap text in CDATA tag. So I can chose what are CDATA and what are not. Thanks!

Jake Goulding · Answer 5 · Fri Nov 11 2016 06:58:56 GMT+0800 (China Standard Time)

I still don't believe we've really gotten to the root of the problem you are trying to solve. Are you interacting with a remote system that does not understand how to decode < in a text node? Perhaps you could restate your problem in the form of some input text-encoded XML that you expect to receive and what you expect the value to be when accessed in Rust? In addition / alternatively, you could provide what kind of Rust code you expect to write with what kind of text-encoded XML you expect to generate.

Referring back to your original comment:

add a text node like <![CDATA[Data]]> will give:
<?xml version=\'1.0\'?><FILE><CONTENT_TYPE>&lt;![CDATA[Data]]&gt;</CONTENT_TYPE></FILE>
In this case I'll get back <![CDATA[Data]]> as CONTENT_TYPE content instead of simply Data.

If you want to get back Data, you should set the value as Data. Adding a value and getting the same value back is exactly what should be expected! Essentially, any code that uses any XML library should never concern itself with encoding details like CDATA or <.

Chances are very good that I will not accept a patch that allows the library to produce broken output, or otherwise double-encodes or double-decodes the text form.

I am unlikely to accept a patch that adds a new node type or a flag to every text element without very good reasons. This would add overhead that every user of the library would be expected to have to deal with.

Controlling at output time whether text nodes prefer CDATA, discourage CDATA, or "smartly" use CDATA would be an acceptable submission.

Michele d'Amico · Answer 6 · Fri Nov 11 2016 17:05:26 GMT+0800 (China Standard Time)

Ok, I'll consider to write a smarty version, but I took a look to some other libraries like the standard java https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html and lxml's python one http://lxml.de/api.html#cdata that support CDATA nodes.... So maybe introduce a new node is not a really strange way to solve it.

Christopher Serr · Answer 7 · Wed Jan 04 2017 00:29:10 GMT+0800 (China Standard Time)

Looks like I need to output data encoded as CDATA as well and there doesn't seem to be any way to do this with this library currently (unless I'm missing something). Basically the format I need to write to encodes small icons into the XML as base-64 encoded cdata sections.

Jake Goulding · Answer 8 · Wed Jan 04 2017 03:25:03 GMT+0800 (China Standard Time)

I need to output data encoded as CDATA as well

I still am unable to understand this. If you have a text node with XML special characters, the library will escape them correctly.

These two are exactly the same when parsed by an XML processor:

<foo>&lt;one&rt;</foo>

<foo><![CDATA[<one>]]></foo>

Although currently we only output the first one.

Are you dealing with a broken parser on the other side that requires CDATA?

XML as base-64 encoded cdata sections.

Base64 is not a part of XML. Your application needs to perform this transformation. Then create a text node with the result. Base64 won't need to be CDATA-escaped, so you'll just end up with <element>BASE64==</element>.