"PDFDocument is not initialized" when startxref is invalid
umaplehurst opened this issue · comments
Bug report
We came across a corrupted .pdf where the startxref
pointer is invalid and points to an offset before the actual xref table in the .pdf. The following backtrace is then observed during loading:
Traceback (most recent call last):
for page in extract_pages(pdf_file, laparams=LAParams(**la_params)):
File "lib\site-packages\pdfminer\high_level.py", line 197, in extract_pages
for page in PDFPage.get_pages(
File "lib\site-packages\pdfminer\pdfpage.py", line 151, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "lib\site-packages\pdfminer\pdfdocument.py", line 722, in __init__
self.read_xref_from(parser, pos, self.xrefs)
File "lib\site-packages\pdfminer\pdfdocument.py", line 1000, in read_xref_from
xref.load(parser)
File "lib\site-packages\pdfminer\pdfdocument.py", line 280, in load
(_, stream) = parser.nextobject()
File "lib\site-packages\pdfminer\psparser.py", line 654, in nextobject
self.do_keyword(pos, token)
File "lib\site-packages\pdfminer\pdfparser.py", line 92, in do_keyword
objlen = int_value(dic["Length"])
File "lib\site-packages\pdfminer\pdftypes.py", line 151, in int_value
x = resolve1(x)
File "lib\site-packages\pdfminer\pdftypes.py", line 118, in resolve1
x = x.resolve(default=default)
File "lib\site-packages\pdfminer\pdftypes.py", line 106, in resolve
return self.doc.getobj(self.objid)
File "lib\site-packages\pdfminer\pdfdocument.py", line 851, in getobj
raise PDFException("PDFDocument is not initialized")
pdfminer.pdftypes.PDFException: PDFDocument is not initialized
How to reproduce
Take a .pdf with an existing startxref
and just set the offset to 0
in the file instead of the real xref table offset. I modified zen_of_python_corrupted.pdf to create this same bug, file is attached: zen_of_python_corrupted_xref.pdf
Thoughts
Adobe Reader is able to open the corrupted file and to repair it. I guess one way to workaround this issue would be to look for an "xref" string after seeking to the pointer and to not trust the offset value blindly.