Passing `extra_attrs=["matrix"]` to `.extract_words()` seems to return "chars" instead of "words"

Question

Passing `extra_attrs=["matrix"]` to `.extract_words()` seems to return "chars" instead of "words"

cmdlineluser opened this issue 6 months ago · comments

Describe the bug

When adding extra_attrs=["matrix"] to .extract_words() it appears to change behaviour and return only chars instead.

Have you tried repairing the PDF?

Not applicable.

Code to reproduce the problem

import pdfplumber

page = pdfplumber.open("ca-warn-report.pdf").pages[0]

print(page.extract_words()[0]["text"])
# 'WARN'

print(page.extract_words(extra_attrs=["matrix"])[0]["text"])
# 'N'

PDF file

https://github.com/jsvine/pdfplumber/raw/stable/examples/pdfs/ca-warn-report.pdf

Expected behavior

The same words returned along with the additional matrix attribute.

Actual behavior

Each individual char was returned.

Screenshots

Not applicable.

Environment

pdfplumber version: 0.11.2
Python version:3.10.9
OS: Mac

Additional context

I was experimenting with trying to parse the PDF in #1170

The goal was to find the intersection of words and a specific subset of page.chars

I'm not sure if it would make sense for extract_words() to also have a return_chars=True option?

As a workaround I was trying to return matrix and noticed this behaviour.

Jeremy Singer-Vine · Answer 1 · Tue Jul 16 2024 06:43:26 GMT+0800 (China Standard Time)

The role of extra_attrs in .extract_words(...) is to monitor additional attributes (beyond the default) for changes — and to start a new "word" when those attributes change. Because matrix is going to be different for virtually every character, it's going to create a new "word" on each character.

Similarly, because each matrix will be different for each character in each word, .extract_words(...) wouldn't know what to return for the matrix value of a multi-character word.

Was there a particular reason that matrix was of interest to you in your goal of finding the intersection of words? Alternatively, perhaps you can explain a bit more about the broader goal?

Karl Genockey · Answer 2 · Tue Jul 16 2024 15:41:48 GMT+0800 (China Standard Time)

matrix is going to be different for virtually every character

Ah... that makes sense - thanks.

perhaps you can explain a bit more about the broader goal

The goal was to try find what words contained a specific subset of page.chars

e.g. find a specific word, use cluster_objects to find "aligned chars", then try to find what words contain those chars.

from pdfplumber.utils import within_bbox, obj_to_bbox

page_words = page.extract_words(keep_blank_chars=True, use_text_flow=True)

word = page.search(r"foo")[0]

cluster = cluster_objects(page.chars + [word], itemgetter("x0"))
chars = [obj for obj in cluster if obj is not word]

words = [
    word for word in page_words
    if within_bbox(chars, obj_to_bbox(word))
]

Which did work, but took a long time. (likely due to calling within_bbox so many times)

I thought .extract_words(..., return_chars=True) may work so I was just experimenting with extra_attrs to see what extra information I could return. (but, my thinking about matrix was flawed as it would not have helped here)

I saw what extract_words did and just used iter_extract_tuples directly:

from pdfplumber.utils.text import WordExtractor

page_words = (
    WordExtractor(keep_blank_chars=True, use_text_flow=True)
     .iter_extract_tuples(page.chars)
) 

page_chars = {}
for word, word_chars in page_words:
    for char in word_chars:
        page_chars[char["matrix"]] = dict(word=word, chars=word_chars)

Searching page_chars keyed by matrix ended up being a much faster approach to find the intersection of chars / words.

Jeremy Singer-Vine · Answer 3 · Sun Aug 04 2024 05:56:39 GMT+0800 (China Standard Time)

Thanks for sharing, and clever solution! Makes me think it'd be helpful to add a return_chars=True parameter to .extract_words(...) (analogous to .search(return_chars=True)).

Jeremy Singer-Vine · Answer 4 · Sun Aug 04 2024 23:15:38 GMT+0800 (China Standard Time)

Now added in 1496cbd 👍

Karl Genockey · Answer 5 · Tue Aug 06 2024 15:20:24 GMT+0800 (China Standard Time)

Awesome stuff - thank you so much!

David Huggins-Daines · Answer 6 · Sat Aug 17 2024 20:09:02 GMT+0800 (China Standard Time)

Oh nice! return_chars=True is going to be very helpful to me, as I want to postprocess extracted words with a tokenizer for input to a language model while retaining the layout information. Current models like LayoutLM that do this, do it wrong ;-)

Jeremy Singer-Vine · Answer 7 · Mon Aug 19 2024 07:25:08 GMT+0800 (China Standard Time)

@dhdaines Happy to hear it, and thanks for sharing that use-case!