jsonkenl / xlsxir

Xlsx parser for the Elixir language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong result when parsing escaped unicode characters

m1dnight opened this issue · comments

I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.

Input

If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as _x000D_.
An underscore is also not allowed, and that one is encoded as _x005F_.
This means that a carriage return is encoded as _x005F_x000D_.
A document with a newline is properly parsed by the library.

I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.

When a cell contains the literal string _x000D_ it is parsed as _x005F_x000D_.

Guess

I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with _x005F, which results in the entire string being represented as _x005F_x000D_.

This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as _x005F_x000D_.

Proof

I have a test case in m1dnight@c335061 this commit that shows the behavior.

I'm not sure though, if this is a bug in SAX or not.

Any ideas on how to proceed?