jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Provide access to page::text_list

stefan6419846 opened this issue · comments

The current wrapper implementation only provides access to the page->text method results.

There is a similar text_list method in the original Poppler code (since version 0.63.0?) which provides access to single words and their bounding boxes. With this, functionality like selecting a clipping region, re-ordering the text or filtering too small text can be achieved. This roughly corresponds to the -bbox option of the CLI.

It would be great if the Python wrapper could provide access to the words with their bounding boxes for further post-processing.

This is somewhat tangential, but you can use the python-poppler package to achieve this (though admittedly it was a bit unclear at first how to do it).
The code would be something along the lines of:

from poppler import load_from_file

# load the file
pdf = load_from_file("somefile.pdf")

# argument can be either 0-based index or a "page label" (whatever the latter is)
# note that this doesn't really "create" a page (in the sense of modifying the
# original or a copy of the PDF), it simply returns a `Page` object
page = pdf.create_page(0) 

# go over the text list
for item in page.text_list():
    print(item.bbox.as_tuple(),item.text)

# getting text from some (rectangular) region
from poppler import Rectangle
text_in_region = page.text(Rectangle(x, y, width, height))