Crash when PDF contains empty pages

Question

Crash when PDF contains empty pages

YasminaFr opened this issue 2 years ago · comments

Hello,

I have a problem with the library pdftotext that I cannot handle.
When I extract text from a pdf containing empty pages and try to print the text of an empty page the code crashes.
For exemple, if I have a pdf with 3 pages and only the first one containing text :

with open(file, "rb") as f:
    pdf = pdftotext.PDF(f)

If I do print(pdf[1]) the code crashes without an explicit error.

Why it does not give me back an empty string for example ?

Thank you for your help.

Jason Alan Palmer · Answer 1 · Fri Aug 12 2022 23:42:58 GMT+0800 (China Standard Time)

Please attach the PDF that is causing the error and also include the exact error message that you're getting, if any.

YasminaFr · Answer 2 · Wed Aug 17 2022 16:05:31 GMT+0800 (China Standard Time)

Hi, sorry I cannot attach the pdf as it's confidential and when I try to anonymize it, the behavior is different.

When I try to print the content of an empty page I have this following error and then the code crashes :

poppler/error: Failed to parse XRef entry [19].poppler/error: Kid object (page 2) is wrong type (null)poppler/error (368953): Illegal character '>'poppler/error: Failed to parse XRef entry [22].poppler/error: Kid object (page 2) is wrong type (null)poppler/error: Failed to parse XRef entry [29].poppler/error: Kid object (page 2) is wrong type (null)poppler/error: Failed to parse XRef entry [32].poppler/error: Kid object (page 2) is wrong type (null)Segmentation fault

The pdf may be broken but I would like to catch this error to prevent the code from crashing but I didn't find a way.

Jason Alan Palmer · Answer 3 · Fri Aug 19 2022 05:57:02 GMT+0800 (China Standard Time)

If you can't provide a PDF that reproduces the issue, how do you propose I help you? I just tried a blank PDF (attached) and it works fine.

blank.pdf