sxd-document cannot parse document containing a UTF-8 BOM

Question

sxd-document cannot parse document containing a UTF-8 BOM

therealprof opened this issue 7 years ago · comments

Disclaimer: I've been working with XML and UTF-8 for a long time and this is the first time I ran into such a problem so I had to do a bit of research to figure out what's going on...

So what I'm trying to do is sort of naive approach to writing an application reading an XML document. The problem is also reproducible using sxd-xpath/evaluate so I'll use that for the sake of easier access and to demonstrate the problem I'll use https://www.broadband-forum.org/cwmp/tr-069-biblio.xml.

This file uses a UTF-8 BOM which read_to_string ()gladly integrates into the resulting String which fails the parser because it expects the beginning of the document to literally be <xml:

# cargo run -- --xpath / tr-069-biblio.xml
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/evaluate --xpath / tr-069-biblio.xml`
Unable to parse input XML
 -> Expected("<?xml")
 -> ExpectedElement
 -> ExpectedWhitespace
 -> ExpectedComment
 -> ExpectedProcessingInstruction
thread 'main' panicked at 'At:
<?xml version=', src/main.rs:52

I'm not sure what the expected behaviour is supposed to be and do see a couple of approaches to address this particular problem:

Have std automatically strip irrelevant magic from file content turned into Strings
Have std provide a normalising read function
Have each application (including sxd-document) specifically deal with this variant

Jake Goulding · Answer 1 · Sat Feb 18 2017 03:15:52 GMT+0800 (China Standard Time)

Hmm. The UTF-8 BOM should never have existed, and now it's causing problems.

fn main() {
    let b = [0xEF, 0xBB, 0xBF, 104, 101, 108, 108, 111];
    let s = std::str::from_utf8(&b).unwrap();
    println!("->{}<-", s);
    println!("{}", s.len());
    for c in s.chars() {
        println!("c: [{}]", c);
    }
}

->hello<-
8
c: []
c: [h]
c: [e]
c: [l]
c: [l]
c: [o]

I'd guess that the most likely solution would be to normalize the text in some fashion. Would you be able to give a crate like unicode-normalization a quick shot to see if it does anything with the BOM?

Daniel Egger · Answer 2 · Tue Feb 21 2017 17:23:33 GMT+0800 (China Standard Time)

I'd guess that the most likely solution would be to normalize the text in some fashion. Would you be able to give a crate like unicode-normalization a quick shot to see if it does anything with the BOM?

So I tried that and it doesn't do a thing to the String happily keeping the BOM.

This little hack works though:

/* If String starts with a BOM, strip it */
if data.as_bytes()[0] == 239 {
    data.remove(0);
}

Andrew McKinlay · Answer 3 · Mon Jan 15 2018 01:14:38 GMT+0800 (China Standard Time)

A BOM should not occur in a string representation of Unicode text in any programming language according to the Unicode spec. The spec says that a BOM is not part of the "Unicode text", and hence should not be present in a programming language implementation of a Unicode string. This makes sense, because the byte order (and encoding) of a string is known implicitly within the programming language (there is no need for in-band signaling).

The standard states a BOM is only valid within the context of a "Unicode encoding scheme," which defines the physical bit representation of a "Unicode encoding form." A BOM is not meant to have meaning within the context of "Unicode text". When a "BOM" is encountered at the abstraction level of Unicode text, it is interpreted as a zero width non-breaking space, not a BOM, no matter where it is in the text.

Having a BOM remain in a string changes the meaning of the Unicode text, because now you technically have a zero width non-breaking space at the beginning of your string that wasn't present in the original encoded form.

What Rust is doing, I don't know. But this has security consequences for string operations like concatenation, and for any text processing libraries that do not expect to encounter a BOM (and they shouldn't have to).

But then again, that's just, like my opinion man. The Unicode standard section on conformance is enlightening. Sorry I got side-tracked, this really isn't an sxd-Document issue. This is a defect in Rust if Unicode strings really include BOM. Even if Rust strings are considered to take on a Unicode encoding form, like UTF-8, they should not carry a BOM. The use of a Unicode encoding scheme is really meant to be reserved for encoding Unicode text in a file or within a network protocol, when its defined as bits without any higher level abstraction.

If this is incoherent just read the conformance clauses in the Unicode spec, they are much clearer. BTW, UTF-8 names both an encoding form and an encoding scheme, the former being a sequence of code unit values, and the later being the byte encoding of those values along with a possible BOM (the encoding form of UTF-8 is trivially equivalent to the encoding scheme without the BOM, since UTF-8 is obviously insensitive to byte order). I'm gonna puke.

Jake Goulding · Answer 4 · Sun Jan 21 2018 23:51:16 GMT+0800 (China Standard Time)

@amckinlay thanks for the illuminating response! :-)

It rather sounds like you should open an issue on the Rust repo. I don't know if such a change would be acceptable or not, given Rust's stability guarantees, but it seems like it's worth a shot!