xmldiff not working with utf-8 BOM files
toyg opened this issue · comments
UTF-8 can come prepended with a Byte-Order Mark that typically specifies endianness, see explanation here. This is fairly common on Windows.
XmlDiff.exe currently dies when a BOM is present, because it's read as the first character instead of the expected <
and ElementTree bombs out with the traceback at the bottom.
In order to read these files correctly in Python, the encoding must be specified as utf-8-sig
. This is backward-compatible (i.e. it will behave exactly the same as utf-8
if the BOM is not present).
As a test case, you can use the following two files, please let me know if you can't get them for any reason:
Traceback (most recent call last):
File "C:\Users\glacava\.pyenv\pyenv-win\versions\3.9.1\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\glacava\.pyenv\pyenv-win\versions\3.9.1\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Utils\pycharm_prj\appdiff\venv391\Scripts\xmldiff.exe\__main__.py", line 7, in <module>
File "c:\utils\pycharm_prj\appdiff\venv391\lib\site-packages\xmldiff\main.py", line 116, in diff_command
result = diff_files(args.file1, args.file2, diff_options=diff_options,
File "c:\utils\pycharm_prj\appdiff\venv391\lib\site-packages\xmldiff\main.py", line 50, in diff_files
return _diff(etree.parse, left, right,
File "c:\utils\pycharm_prj\appdiff\venv391\lib\site-packages\xmldiff\main.py", line 36, in _diff
left_tree = parse_method(left, parser)
File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1880, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1900, in lxml.etree._parseFilelikeDocument
File "src\lxml\parser.pxi", line 1795, in lxml.etree._parseDocFromFilelike
File "src\lxml\parser.pxi", line 1201, in lxml.etree._BaseParser._parseDocFromFilelike
File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
File "C:\Utils\pycharm_prj\appdiff\tests\test_data\both_1.xml", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
I can't reproduce this problem. It works for me. What version of lxml are you using?
... I sure hope this isn't a Windows-only problem, because that will be tricky to fix.
lxml==4.6.2
six==1.15.0
Python is 3.9.1 but i saw it on 3.8 too.
I suspect the issue is that argparse defaults to a better encoding on non-Windows.
OK, merged.