Format.getRawFormat() changes whitespace

Question

Format.getRawFormat() changes whitespace

hansenc opened this issue 5 years ago · comments

The docs for Format.getRawFormat() say that it "performs no whitespace changes", but the behavior does not match that statement. Here is an example with JDom 2.0.6. I am unsure if this is working as designed or there are bugs so let me know.

JDom code:

try (FileReader in = new FileReader("whitespace.xml");
     FileWriter out = new FileWriter("out.xml")) {
    Document document = new SAXBuilder().build(in);
    new XMLOutputter(Format.getRawFormat()).output(document, out);
}

cat -t whitespace.xml (Tabs are displayed as ^I)

<?xml version="1.0" encoding="UTF-8"?>
<root
    xmlns="http://example.com"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://example.com http://example.com/example.xsd">

  <line-breaks

      attribute="has
line
breaks

">

  Content
  has
  line
  breaks
  </line-breaks>

  <spaces     attribute   =   " has spaces "  >   Content has spaces  </spaces>
  <spaces        />
  <no-spaces attribute="noSpaces"><no-spaces>ContentHasNoSpaces</no-spaces></no-spaces>
  <no-content attribute="no content"/>
  <tabs^Iattribute="^Ihas^Itabs^I">^IContent^Ihas^Itabs^I</tabs>

</root>

(Note: I converted the line endings of the output file to unix to make the following diff easier to read.)
diff whitespace.xml out-unix.xml | cat -t (Tabs are displayed as ^I)

< <root
<     xmlns="http://example.com"
<     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<     xsi:schemaLocation="http://example.com http://example.com/example.xsd">
---
> <root xmlns="http://example.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://example.com http://example.com/example.xsd">
7,13c4
<   <line-breaks
< 
<       attribute="has
< line
< breaks
< 
< ">
---
>   <line-breaks attribute="has line breaks  ">
21,22c12,13
<   <spaces     attribute   =   " has spaces "  >   Content has spaces  </spaces>
<   <spaces        />
---
>   <spaces attribute=" has spaces ">   Content has spaces  </spaces>
>   <spaces />
24,25c15,16
<   <no-content attribute="no content"/>
<   <tabs^Iattribute="^Ihas^Itabs^I">^IContent^Ihas^Itabs^I</tabs>
---
>   <no-content attribute="no content" />
>   <tabs attribute=" has tabs ">^IContent^Ihas^Itabs^I</tabs>

Summary of whitespace changes:

Line endings changed from UNIX \n to DOS \r\n. This can be worked around by detecting line endings for the original file and setting that on the Format.
Whitespace between element name, attribute name, equals sign, and closing bracket replaced with a single space (e.g. <spaces attribute = "..." > and <spaces /> and <tabs^Iattribute=...). I don't know of a workaround for this, but I would like to know if there is one.

Indentation and whitespace within attributes and content is preserved (and comments from what I have seen, though I have not tested CDATA).

Jason Hunter · Answer 1 · Thu Dec 05 2019 12:57:50 GMT+0800 (China Standard Time)

The output action didn't change the whitespace. The input action did. :)

And it's not just a JDOM thing. SAX doesn't preserve all the whitespace you're expecting. SAX is allowed to remove unimportant whitespace, and it does. SAXBuilder thus never gets that whitespace reported to it.

The docs include a note about this http://www.jdom.org/docs/apidocs/org/jdom2/input/SAXBuilder.html

Chris Hansen · Answer 2 · Sat Dec 07 2019 01:34:16 GMT+0800 (China Standard Time)

Thanks for the prompt reply @hunterhacker . That definitely explains most of the whitespace changes, but not the line endings. Setting a breakpoint, I can see many \n line endings in the in-memory representation of the Document, however Format.getRawFormat() still outputs \r\n per the default LineSeparator (with no system property set). There are a couple ways I could see resolving this:

Update the JavaDoc for Format.getRawFormat() to call out this behavior. The current documentation has no mention that the line separator may be changed and instead says "no whitespace changes" which was misleading to me.
Update the JavaDoc and add an overloaded version of Format.getRawFormat(Document) that auto-detects the line separator for the Document like
return Format.getRawFormat().setLineSeparator(detectLineSeparator(document));

Let me know if a change would be worthwhile and I can work on a PR soon.

Chris Hansen · Answer 3 · Sat Dec 07 2019 03:11:01 GMT+0800 (China Standard Time)

Oops, it looks like I'm incorrect again and the line endings are changed during reading. Indeed parsing a file with DOS line endings still shows as \n in memory. I don't think any change is warranted here. Sorry for this.