Passing `extra_attrs=["matrix"]` to `.extract_words()` seems to return "chars" instead of "words"
cmdlineluser opened this issue · comments
Describe the bug
When adding extra_attrs=["matrix"]
to .extract_words()
it appears to change behaviour and return only chars instead.
Have you tried repairing the PDF?
Not applicable.
Code to reproduce the problem
import pdfplumber
page = pdfplumber.open("ca-warn-report.pdf").pages[0]
print(page.extract_words()[0]["text"])
# 'WARN'
print(page.extract_words(extra_attrs=["matrix"])[0]["text"])
# 'N'
PDF file
https://github.com/jsvine/pdfplumber/raw/stable/examples/pdfs/ca-warn-report.pdf
Expected behavior
The same words returned along with the additional matrix attribute.
Actual behavior
Each individual char was returned.
Screenshots
Not applicable.
Environment
- pdfplumber version: 0.11.2
- Python version:3.10.9
- OS: Mac
Additional context
I was experimenting with trying to parse the PDF in #1170
The goal was to find the intersection of words and a specific subset of page.chars
I'm not sure if it would make sense for extract_words()
to also have a return_chars=True
option?
As a workaround I was trying to return matrix
and noticed this behaviour.
The role of extra_attrs
in .extract_words(...)
is to monitor additional attributes (beyond the default) for changes — and to start a new "word" when those attributes change. Because matrix
is going to be different for virtually every character, it's going to create a new "word" on each character.
Similarly, because each matrix
will be different for each character in each word, .extract_words(...)
wouldn't know what to return for the matrix
value of a multi-character word.
Was there a particular reason that matrix
was of interest to you in your goal of finding the intersection of words? Alternatively, perhaps you can explain a bit more about the broader goal?
matrix is going to be different for virtually every character
Ah... that makes sense - thanks.
perhaps you can explain a bit more about the broader goal
The goal was to try find what words contained a specific subset of page.chars
e.g. find a specific word, use cluster_objects to find "aligned chars", then try to find what words contain those chars.
from pdfplumber.utils import within_bbox, obj_to_bbox
page_words = page.extract_words(keep_blank_chars=True, use_text_flow=True)
word = page.search(r"foo")[0]
cluster = cluster_objects(page.chars + [word], itemgetter("x0"))
chars = [obj for obj in cluster if obj is not word]
words = [
word for word in page_words
if within_bbox(chars, obj_to_bbox(word))
]
Which did work, but took a long time. (likely due to calling within_bbox
so many times)
I thought .extract_words(..., return_chars=True)
may work so I was just experimenting with extra_attrs
to see what extra information I could return. (but, my thinking about matrix
was flawed as it would not have helped here)
I saw what extract_words
did and just used iter_extract_tuples
directly:
from pdfplumber.utils.text import WordExtractor
page_words = (
WordExtractor(keep_blank_chars=True, use_text_flow=True)
.iter_extract_tuples(page.chars)
)
page_chars = {}
for word, word_chars in page_words:
for char in word_chars:
page_chars[char["matrix"]] = dict(word=word, chars=word_chars)
Searching page_chars
keyed by matrix
ended up being a much faster approach to find the intersection of chars / words.
Thanks for sharing, and clever solution! Makes me think it'd be helpful to add a return_chars=True
parameter to .extract_words(...)
(analogous to .search(return_chars=True)
).
Now added in 1496cbd 👍
Awesome stuff - thank you so much!
Oh nice! return_chars=True
is going to be very helpful to me, as I want to postprocess extracted words with a tokenizer for input to a language model while retaining the layout information. Current models like LayoutLM that do this, do it wrong ;-)
@dhdaines Happy to hear it, and thanks for sharing that use-case!