JoshData / pdf-diff

A PDF comparison utility in Python.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

lxml error

pimpampoum opened this issue · comments

Hello,

I got this error using your pdf-diff.py : any idea ?

$ ../VISA_III/pdf-diff.py ../VISA_III/visa_iii.pdf ../VISA_III/visa_iii_old.pdf > diff_visa_iii_iv.png
Traceback (most recent call last):
File "../VISA_III/pdf-diff.py", line 456, in
changes = compute_changes(left_file, right_file, top_margin=top_margin)
File "../VISA_III/pdf-diff.py", line 9, in compute_changes
docs = [serialize_pdf(0, pdf_fn_1, top_margin), serialize_pdf(1, pdf_fn_2, top_margin)]
File "../VISA_III/pdf-diff.py", line 24, in serialize_pdf
for run in box_generator:
File "../VISA_III/pdf-diff.py", line 84, in mark_eol_hyphens
for next_box in boxes:
File "../VISA_III/pdf-diff.py", line 57, in pdf_to_bboxes
dom = etree.fromstring(xml)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124533)
File "src/lxml/parser.pxi", line 1707, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:123074)
File "src/lxml/parser.pxi", line 1079, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:117114)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111124)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 7, line 10432, column 81

Looks like a character encoding issue, either because pdftotext and subprocess.check_output aren't using the same encoding or etree.fromstring isn't quite the right way to load XML.

Thanks.
Well, I'm affraid you're right. There are plenty of maths formulas that pdftotext can't deal with.

Ideally this module wouldn't crash in those cases, so something probably can be fixed (although I don't have time to try myself).

FYI I ran into a similar issue during the 2017 mozilla global sprint, where I used this library, and have a potential patch/PR to fix this. Do you have any interest in that?

See PR #12 above, which should resolve this.