jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is it possible to find text coordinates on the page using pdftotext?

jeanmonet opened this issue · comments

I suspect the answer is no, but wanted to check in case I'm missing something.
If pdftotext / poppler is not able to provide text coordinates on the page, do you know of another reliable tool to do so?

I believe poppler can do that with some of its command-line tools, yes. But it's not a part of this python library. This library is meant to be fast and simple: all it does is extract full pages of text.

I guess I need to talk to the folks in the poppler project. I would like to know how it manages to mistake the the word "OTHER" for "0THER" (a zero for the letter "O" when clearly it appears as the letter "O". This document is not a "scan". A PDF should contain Font drawing instructions. I don't see how poppler could confuse the two. I guess I need to learn the internal coding of a PDF (as the folks at Poppler did) so I can see where this problem originates.

@cordeliac please only post comments that are relevant to the issue you are commenting on.

Thanks for the confirmation! I appreciate the reliability of this tool for text extraction. Will see what I can do with pdfminer for a more advanced usage, although in the past I found it less reliable for accurate text extraction.