jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crash when PDF contains empty pages

YasminaFr opened this issue · comments

Hello,

I have a problem with the library pdftotext that I cannot handle.
When I extract text from a pdf containing empty pages and try to print the text of an empty page the code crashes.
For exemple, if I have a pdf with 3 pages and only the first one containing text :

with open(file, "rb") as f:
    pdf = pdftotext.PDF(f)

If I do print(pdf[1]) the code crashes without an explicit error.

Why it does not give me back an empty string for example ?

Thank you for your help.

Please attach the PDF that is causing the error and also include the exact error message that you're getting, if any.

Hi, sorry I cannot attach the pdf as it's confidential and when I try to anonymize it, the behavior is different.

When I try to print the content of an empty page I have this following error and then the code crashes :

poppler/error: Failed to parse XRef entry [19].poppler/error: Kid object (page 2) is wrong type (null)poppler/error (368953): Illegal character '>'poppler/error: Failed to parse XRef entry [22].poppler/error: Kid object (page 2) is wrong type (null)poppler/error: Failed to parse XRef entry [29].poppler/error: Kid object (page 2) is wrong type (null)poppler/error: Failed to parse XRef entry [32].poppler/error: Kid object (page 2) is wrong type (null)Segmentation fault

The pdf may be broken but I would like to catch this error to prevent the code from crashing but I didn't find a way.

If you can't provide a PDF that reproduces the issue, how do you propose I help you? I just tried a blank PDF (attached) and it works fine.

blank.pdf