"PDFDocument is not initialized" when startxref is invalid

Question

"PDFDocument is not initialized" when startxref is invalid

umaplehurst opened this issue 6 months ago · comments

Ursula Maplehurst commented 6 months ago

Bug report

We came across a corrupted .pdf where the startxref pointer is invalid and points to an offset before the actual xref table in the .pdf. The following backtrace is then observed during loading:

Traceback (most recent call last):
    for page in extract_pages(pdf_file, laparams=LAParams(**la_params)):
  File "lib\site-packages\pdfminer\high_level.py", line 197, in extract_pages
    for page in PDFPage.get_pages(
  File "lib\site-packages\pdfminer\pdfpage.py", line 151, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 722, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 1000, in read_xref_from
    xref.load(parser)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 280, in load
    (_, stream) = parser.nextobject()
  File "lib\site-packages\pdfminer\psparser.py", line 654, in nextobject
    self.do_keyword(pos, token)
  File "lib\site-packages\pdfminer\pdfparser.py", line 92, in do_keyword
    objlen = int_value(dic["Length"])
  File "lib\site-packages\pdfminer\pdftypes.py", line 151, in int_value
    x = resolve1(x)
  File "lib\site-packages\pdfminer\pdftypes.py", line 118, in resolve1
    x = x.resolve(default=default)
  File "lib\site-packages\pdfminer\pdftypes.py", line 106, in resolve
    return self.doc.getobj(self.objid)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 851, in getobj
    raise PDFException("PDFDocument is not initialized")
pdfminer.pdftypes.PDFException: PDFDocument is not initialized

How to reproduce

Take a .pdf with an existing startxref and just set the offset to 0 in the file instead of the real xref table offset. I modified zen_of_python_corrupted.pdf to create this same bug, file is attached: zen_of_python_corrupted_xref.pdf

Thoughts

Adobe Reader is able to open the corrupted file and to repair it. I guess one way to workaround this issue would be to look for an "xref" string after seeking to the pointer and to not trust the offset value blindly.