Is it possible to find text coordinates on the page using pdftotext?

Question

Is it possible to find text coordinates on the page using pdftotext?

jeanmonet opened this issue 4 years ago · comments

I suspect the answer is no, but wanted to check in case I'm missing something.
If pdftotext / poppler is not able to provide text coordinates on the page, do you know of another reliable tool to do so?

Jason Alan Palmer · Answer 1 · Sun Sep 27 2020 01:58:18 GMT+0800 (China Standard Time)

I believe poppler can do that with some of its command-line tools, yes. But it's not a part of this python library. This library is meant to be fast and simple: all it does is extract full pages of text.

cordeliac · Answer 2 · Sun Sep 27 2020 03:02:20 GMT+0800 (China Standard Time)

I guess I need to talk to the folks in the poppler project. I would like to know how it manages to mistake the the word "OTHER" for "0THER" (a zero for the letter "O" when clearly it appears as the letter "O". This document is not a "scan". A PDF should contain Font drawing instructions. I don't see how poppler could confuse the two. I guess I need to learn the internal coding of a PDF (as the folks at Poppler did) so I can see where this problem originates.

Jason Alan Palmer · Answer 3 · Sun Sep 27 2020 03:16:39 GMT+0800 (China Standard Time)

@cordeliac please only post comments that are relevant to the issue you are commenting on.

jeanmonet · Answer 4 · Sun Sep 27 2020 03:27:21 GMT+0800 (China Standard Time)

Thanks for the confirmation! I appreciate the reliability of this tool for text extraction. Will see what I can do with pdfminer for a more advanced usage, although in the past I found it less reliable for accurate text extraction.