Wrong result when parsing escaped unicode characters
m1dnight opened this issue · comments
I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.
Input
If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as _x000D_
.
An underscore is also not allowed, and that one is encoded as _x005F_
.
This means that a carriage return is encoded as _x005F_x000D_
.
A document with a newline is properly parsed by the library.
I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.
When a cell contains the literal string _x000D_
it is parsed as _x005F_x000D_
.
Guess
I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with _x005F
, which results in the entire string being represented as _x005F_x000D_
.
This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as _x005F_x000D_
.
Proof
I have a test case in m1dnight@c335061 this commit that shows the behavior.
I'm not sure though, if this is a bug in SAX or not.
Any ideas on how to proceed?