jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pdftotext has unusual failures

cordeliac opened this issue · comments

I actually thought that pdftotext was using "measurements" and "font information" too extract text. More or less reading something like at 2.5" from the top, and 1.74" from the left boarder print the letters "OTH" in Times roman 14 pt italic . Of course it's not quite so strait forward...but I didn't think pdftotext was using any sort of OCR such that is was looking at a small oval and trying to decide if that's the letter "O" or the Number "0". But I have two successive lines in a PDF "report". Using a monospaced font like Courier. The result varies and sometime I get extra spaces in between each character and really odd alignment issues (using -layout). Obviously when I open up the document in Acrobat I don't see anything that leads to what causes these errors. I've attached some TEXT (black background) and PDF (white background) representative areas where the output seems like it "should" be better. I occasionally have negative signs replaced by a tilde ~ I mean...they can look somewhat alike -- but I wasn't expecting an OCR type result. Can anyone offer up an explanation?

Screen Shot 2020-09-17 at 1 01 25 AM

Screen Shot 2020-09-17 at 12 59 13 AM

Screen Shot 2020-09-17 at 12 57 59 AM

Screen Shot 2020-09-17 at 12 57 31 AM

There is no OCR here. If you want help with a PDF, please attach it here or provide a link to it.

My request for a PDF demonstrating the issue has been ignored for two months. Closing

I appologize jalan, -- you may be able to tell from the samples I displayed that the PDF in question is hundreds of pages long and contains sensitive information about wages and salary information where I cannot provide the document or even an entire page from a document. Your answer though that states there is no OCR provided most of what I needed to know. Some scanning document archiving services run their own OCR on the document and they store this information within the PDF file somehow. I noticed in Acrobat if I "select" the text and use copy and paste I am getting the same result. So although it renders "correctly" the text selected doesn't match the render. The fault must lie with the process that turned the "report" into a PDF as the invalid characters are indeed within the document.