Shoobx / xmldiff

A library and command line utility for diffing xml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

xmldiff not working with utf-8 BOM files

toyg opened this issue · comments

UTF-8 can come prepended with a Byte-Order Mark that typically specifies endianness, see explanation here. This is fairly common on Windows.

XmlDiff.exe currently dies when a BOM is present, because it's read as the first character instead of the expected < and ElementTree bombs out with the traceback at the bottom.

In order to read these files correctly in Python, the encoding must be specified as utf-8-sig. This is backward-compatible (i.e. it will behave exactly the same as utf-8 if the BOM is not present).

As a test case, you can use the following two files, please let me know if you can't get them for any reason:

Traceback (most recent call last):
  File "C:\Users\glacava\.pyenv\pyenv-win\versions\3.9.1\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\glacava\.pyenv\pyenv-win\versions\3.9.1\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Utils\pycharm_prj\appdiff\venv391\Scripts\xmldiff.exe\__main__.py", line 7, in <module>
  File "c:\utils\pycharm_prj\appdiff\venv391\lib\site-packages\xmldiff\main.py", line 116, in diff_command
    result = diff_files(args.file1, args.file2, diff_options=diff_options,
  File "c:\utils\pycharm_prj\appdiff\venv391\lib\site-packages\xmldiff\main.py", line 50, in diff_files
    return _diff(etree.parse, left, right,
  File "c:\utils\pycharm_prj\appdiff\venv391\lib\site-packages\xmldiff\main.py", line 36, in _diff
    left_tree = parse_method(left, parser)
  File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1880, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1900, in lxml.etree._parseFilelikeDocument
  File "src\lxml\parser.pxi", line 1795, in lxml.etree._parseDocFromFilelike
  File "src\lxml\parser.pxi", line 1201, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
  File "C:\Utils\pycharm_prj\appdiff\tests\test_data\both_1.xml", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

I can't reproduce this problem. It works for me. What version of lxml are you using?

... I sure hope this isn't a Windows-only problem, because that will be tricky to fix.

lxml==4.6.2
six==1.15.0

Python is 3.9.1 but i saw it on 3.8 too.

I suspect the issue is that argparse defaults to a better encoding on non-Windows.

OK, merged.