FasterXML / jackson-dataformat-xml

Extension for Jackson JSON processor that adds support for serializing POJOs as XML (and deserializing from XML) as an alternative to JSON

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jackson 2.14 change in XML mixed content behaviour

dan2097 opened this issue · comments

Jackson 2.14 under some circumstances uses an empty key for mixed content even when doing so is unnecessary, and wasn't done by 2.13. I didn't notice this change in the release notes so I was wondering whether it was intentional, as subjectively it seems worse.

import java.io.StringReader;

import javax.xml.stream.XMLStreamConstants;

import org.codehaus.stax2.XMLStreamReader2;

import com.ctc.wstx.stax.WstxInputFactory;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.node.ArrayNode;
import com.fasterxml.jackson.databind.node.JsonNodeFactory;
import com.fasterxml.jackson.dataformat.xml.XmlMapper;

public class JacksonMixedContentHandlingChange {

  public static void main(String[] args) throws Exception {
    String xml = "<rel-passage><passage><figure>2-3</figure>section 3.3, paragraphs 1-2</passage><passage>section 3.3, last subsection Optimization with Part Segmentation</passage><passage>equations 9-11</passage><passage>section 4.4 and table 4</passage></rel-passage>";
    XMLStreamReader2 reader = (XMLStreamReader2) new WstxInputFactory().createXMLStreamReader(new StringReader(xml));
    XmlMapper xmlMapper = new XmlMapper();
    ArrayNode passages = JsonNodeFactory.instance.arrayNode();
    while (reader.hasNext()) {
      switch (reader.next()) {
      case XMLStreamConstants.START_ELEMENT:
        switch (reader.getLocalName()) {
        case "passage":
          passages.add(xmlMapper.readValue(reader, JsonNode.class));
          break;
        default:
          break;
        }
        break;
      case XMLStreamConstants.END_ELEMENT:
        System.out.println(passages.toPrettyString());
      }
    }
  }
}

Jackson 2.12.1 and Jackson 2.13.5

[ {
  "figure" : "2-3",
  "" : "section 3.3, paragraphs 1-2"
}, "section 3.3, last subsection Optimization with Part Segmentation", "equations 9-11", "section 4.4 and table 4" ]

Jackson 2.14.0 and Jackson 2.14.2

[ {
  "figure" : "2-3",
  "" : "section 3.3, paragraphs 1-2"
}, {
  "" : "section 3.3, last subsection \"Optimization with Part Segmentation\""
}, {
  "" : "equations 9-11"
}, {
  "" : "section 4.4 and table 4"
} ]

The behaviour is also somewhat inconsistent in that even with Jackson 2.14 if you map the entire document at once you get:

{
    "passage": [{
            "figure": "2-3",
            "": "section 3.3, paragraphs 1-2"
        }, "section 3.3, last subsection Optimization with Part Segmentation", "equations 9-11", "section 4.4 and table 4"]
}

Quick note: I don't know the answer, but the usage as shown -- iterating over XMLStreamReader, passing sub-trees to ObjectReader -- is not really supported, or behavior defined.
This because internal translation occurs between Stax XMLStreamReader[2] and Jackson JsonParser API and state is managed during sort of full databinding of the whole document.

Put another way: reading whole XML document into JsonNode is supported and behavior should remain consistent or improve. Similarly for custom deserializers, reading JsonNode via DeserializationContext.readTree() should be supported. But trying to directly read JsonNodes from XML token stream is not really expected to be used.
If it works, fine, but it is not supported API.

Having said that, you also mentioned that the whole document case is inconsistent. This should not happen since XML mixed content support was added, I think, in 2.12.0. That is, behavior for that case should not have changed 2.12 -> 2.13 -> 2.14.
(but once again, piece-by-piece binding directly from XMLStreamReader is not supported API)

Having said that, you also mentioned that the whole document case is inconsistent. This should not happen since XML mixed content support was added, I think, in 2.12.0. That is, behavior for that case should not have changed 2.12 -> 2.13 -> 2.14.
(but once again, piece-by-piece binding directly from XMLStreamReader is not supported API)

The inconsistency was only as compared to reading the sub-tree, reading the entire document gave the expected JSON in all versions.

iterating over XMLStreamReader, passing sub-trees to ObjectReader -- is not really supported, or behavior defined.

Does that mean there's not an idiomatic way to when streaming over a very large XML document convert sections to JSON?
For smaller documents I guess you could keep a copy of the original XML in memory and use XMLStreamReader location information to pull out the start/end character offsets of the XML section, then create a new XMLStreamReader over that substring?

@dan2097 Idiomatic usage that might work bit better would be to use FromXmlParser (subtype of JsonParser), instead of underlying XMLStreamReader. This is not guaranteed to be quite as immutable as high level, but should be more stable than XMLStreamReader.

Apologies for this being essentially undocumented: I understand that in this case API does allow usage, without indicating issues.

The technical challenge here is that there is a multi-level transformation from XML event sequence (from Stax API) into JSON token sequence (Jackson's streaming JsonParser), latter then used similarly (ideally identically) between all formats. But with XML logical model being different from XML there are certain translations that need to be done first, and this is where state-keeping by FromXmlParser matters -- so ObjectMapper.readValue() will create new parser instance, which then does not have state that is required for some of translations (of white space, I think).

So, if at all possible I would try constructing FromXmlParser (exposed as JsonParser) and feeding that to ObjectMapper.readValue().

I hope this is possible as I agree that ability to process sub-sections without reading the whole input document is very useful and important feature.

Do you mean replacing:

     case XMLStreamConstants.START_ELEMENT:
        switch (reader.getLocalName()) {
        case "passage":
          passages.add(xmlMapper.readValue(reader, JsonNode.class));
          break;

with

     case XMLStreamConstants.START_ELEMENT:
        switch (reader.getLocalName()) {
        case "passage":
          FromXmlParser fromXml = new XmlFactory().createParser(reader);
          passages.add(objectMapper.readValue(fromXml, JsonNode.class));
          break;

These both gave the same output suboptimal on v2.14.2 (and conversely both give the preferable output on v2.13.5).

Unrelatedly, the Javadoc for getStaxReader() on FromXmlParser doesn't seem right as it talks exclusively about writers.

Not quite, I mean constructing FromXmlParser via XmlFactory (accessible from XmlMapper, getFactory() or so), for given XMLStreamReader. State is handled by parser that wraps underlying XMLStreamReader.

So basically only pass in XMLStreamReader once and do not iterate with it over input: FromXmlParser must have full state and control over input.

As I said I am not sure if this works any better, but it would allow keeping state that XML module expects wrt content and massaging it to work with JSON-centric model Jackson has at streaming level.

New behavior is the expected one at this point: or more specifically, I have no plans in reverting to 2.13 behavior.

Closing.

I couldn't immediately figure out how to get the original behaviour using your suggestion. I ended up implementing a conversion specific to my use case, as on reflection even the original output still wasn't ideal (a key of "" isn't especially intuitive).

Yeah, unfortunately mapping Mixed Content into non-XML-aware/specific framework is quite difficult. So custom handling may be necessary.