Unable to convert specific pdf file correctly

Question

Unable to convert specific pdf file correctly

OsamaAnwer opened this issue 4 years ago · comments

python pdftotext is unable to convert following file correctly.
The file has 4 pages out of which first 3 pages text extraction is empty.
sample-file.pdf

However, If I try to convert the same file through popper pdftotext utility, it is working fine. This is happening for specific file only.

sample-text-extraction-poppler-pdftotext-utlity-output.txt

I also tried to debug the issue by adding some logs, but it didn't help.

Zubair Uddin Farooqui · Answer 1 · Tue Oct 20 2020 18:50:29 GMT+0800 (China Standard Time)

I am facing the same issue, can anyone please help me out?

Tushar Makkar · Answer 2 · Fri Feb 12 2021 15:53:06 GMT+0800 (China Standard Time)

I am also facing similar issue.. Is it related to poppler version? My version for libpoppler is 0.62.0

Palaparthi NBB Anirudh · Answer 3 · Sun May 09 2021 20:48:17 GMT+0800 (China Standard Time)

This can be fixed by saving each page as an image temporarily and then process them using pdftotext or pytesseract.
I've done it so and can do it. If given a chance, I would like to contribute. I know my way around git and python but this will be my first time contributing. So, please let me know If I can and how to.

Jason Alan Palmer · Answer 4 · Sun May 16 2021 08:36:03 GMT+0800 (China Standard Time)

Fixed in version 2.1.6, as long as the version of poppler is new enough. Very old versions may still have issues. Thanks for the report, @OsamaAnwer.