jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to convert specific pdf file correctly

OsamaAnwer opened this issue · comments

python pdftotext is unable to convert following file correctly.
The file has 4 pages out of which first 3 pages text extraction is empty.
sample-file.pdf

However, If I try to convert the same file through popper pdftotext utility, it is working fine. This is happening for specific file only.

image

sample-text-extraction-poppler-pdftotext-utlity-output.txt

I also tried to debug the issue by adding some logs, but it didn't help.

I am facing the same issue, can anyone please help me out?

I am also facing similar issue.. Is it related to poppler version? My version for libpoppler is 0.62.0

This can be fixed by saving each page as an image temporarily and then process them using pdftotext or pytesseract.
I've done it so and can do it. If given a chance, I would like to contribute. I know my way around git and python but this will be my first time contributing. So, please let me know If I can and how to.

Fixed in version 2.1.6, as long as the version of poppler is new enough. Very old versions may still have issues. Thanks for the report, @OsamaAnwer.