jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

poppler/error: Failed to parse XRef entry [11].poppler/error: Top-level pages object is wrong type (null)

juanfrilla opened this issue · comments

Receiving this error on this url:
poppler/error: Failed to parse XRef entry [11].poppler/error: Top-level pages object is wrong type (null)
https://www.asamblea.gob.sv/sites/default/files/documents/decretos/6BD1CFE2-9948-4D32-A45D-92FF50D15C0A.pdf

And with this code:

import io
import requests
import pdftotext
url = "https://www.asamblea.gob.sv/sites/default/files/documents/decretos/6BD1CFE2-9948-4D32-A45D-92FF50D15C0A.pdf"
content = requests.get(url).content
pdf = pdftotext.PDF(io.BytesIO(content))

i'm using poppler-utils-0.26.5-43.el7.1.x86_64
pdftotext version 0.26.5
on a centos server, I don't know If I need to upgrade poppler. Is there anything I can do without upgrading poppler?
Or Is there a way of catching this poppler error and skip the PDF that gives that error

Is there a way of catching this poppler error and skip the PDF that gives that error

Sure, you can include exception handling:

import io
import requests
import pdftotext

url = "https://www.asamblea.gob.sv/sites/default/files/documents/decretos/6BD1CFE2-9948-4D32-A45D-92FF50D15C0A.pdf"
content = requests.get(url).content
try:
    pdf = pdftotext.PDF(io.BytesIO(content))
except pdftotext.Error as exception:
    # Do whatever you want here
    print(f"I couldn't open that PDF: {exception}")