jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Repeating characters

samkit-jain opened this issue · comments

I'm facing a weird problem wherein characters are repeated when using extract_text() or extract_tables(). Example, SSttaatteemmeenntt ooff AAccccoouunnttss is printed instead of Statement of Accounts.

Sometimes, it happens in a portion of the PDF and sometimes in the whole PDF. When this happens in a portion of PDF, it is fixable (not completely) via extract_text(x_tolerance=0, y_tolerance=0) but not when the issue affects the whole PDF. Also, note that I do not face this issue in all PDFs but in some.

Lines are also repeated. Example,

Year-to-date totals do not reflect any fee or interest refunds
Year-to-date totals do not reflect any fee or interest refunds
you may have received.
you may have received.

On doing first_page.extract_words(x_tolerance=0, y_tolerance=0), there are two instances of a single word

{'x0': Decimal('231.532'), 'x1': Decimal('252.251'), 'top': Decimal('916.343'), 'bottom': Decimal('925.422'), 'text': 'reflect'}
{'x0': Decimal('231.533'), 'x1': Decimal('252.252'), 'top': Decimal('916.383'), 'bottom': Decimal('925.462'), 'text': 'reflect'}

And repeating characters are still present for some words,

{'x0': Decimal('489.040'), 'x1': Decimal('506.160'), 'top': Decimal('269.320'), 'bottom': Decimal('277.480'), 'text': 'ttooddaayy'}

That's strange, indeed. My hunch is that there really are two copies of each letter in the PDF. (One set of letters might be transparent, perhaps?) What happens if you try extracting the text with another tool, such as poppler-utils's pdftotext? (https://en.wikipedia.org/wiki/Pdftotext)

No such problem with pdftotext. This is the output,

No repeating lines

Year-to-date totals do not reflect any fee or interest refunds
you may have received.

No repeating characters

today
Statement of Accounts

I've encountered this problem as well. In my case it was cropping up in fillable pdfs, and I theorized that the folks filling out the pdf were somehow resaving it on top of the original text. I found it was easier to just remove duplicate characters via script than make sense of the pdf. I dunno for sure, I suspect that other pdf output tools are removing duplicate characters.

I'm not really sure what the right solution is, but possibly adding a 'remove duplicate characters' option would make this more manageable? My case involved exact matches--characters occurring at exactly the same spot--so a fix was easy... I suppose if they were slightly offset it would be more challenging.

Getting same issue, please pass some resolution

AFAIK, duplicated characters are also for bold representation and there will be cases with small offset.
Deduplication is possible by checking overlap ratio of all characters using coordinates.

Any solution to it? I have the same issue.

commented

I recently stumbled across this issue - just tossing it out there to let folks know it's a continuing thing.

@hannylicious and other watchers of this issue, if you have a PDF with this issue that you can share publicly, please do so that this issue can be investigated in further detail.

I am pretty sure I have a PDF with this issue but it will take me some time to find it.

commented

Unfortunately - I dabble with PDF's very infrequently. I just happened across it this time because another library (pyPDF2) didn't see any text at all - whereas pdfplumber saw the text, but it was duplicated. The PDF I'm working with at this time has some information that I can't publicly display so I won't be of much assistance I'm afraid.

I resolved my use case simply by grabbing the first of the results and using that.

Pdfplumber is a great tool - I will most likely be using this from now on! If I run across this issue on a PDF that I can link up - I definitely will!

Thanks, @hannylicious! If you have the time, you could try using https://github.com/JoshData/pdf-redactor to remove the sensitive information without altering the PDF structure. If the result still produces the same character-duplication, then it could be very helpful for resolving this issue.

commented

Thanks @jsvine - I will definitely have a look at that pdf-redactor library. If it works - I'll be sure and post that PDF here!

I would like to help, but my file has confidential content. Anyone have some issue file?

Same issue here.

@pajaskowiak Can you share a PDF that demonstrates the issue?

repeat.pdf
Getting samge issue, the pdf is repeat.pdf

commented

The duplicate text indeed is drawn twice in the PDF, the second time with a small horizontal offset to create the appearance of a bold font.
Actually, though, this PDF gives a hint that the second copy shall be ignored by marking it with an empty ActualText property. By evaluating that property, therefore, pdfplumber could correctly extract this PDF.

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

commented

Indeed, there are many PDFs out there drawing text twice for some visual effect (bold, shadow, ...) but by far not all of them use ActualText to mark one copy as ignorable like @xv44586's example file does. Thus, finding duplicates explicitly will help more often in this regard than checking the ActualText.

@pajaskowiak Can you share a PDF that demonstrates the issue?

I'm really sorry but I can't. It contains sensitive information.

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

I did something similar to this. Anyways, I could fix the duplicates in my own code. Having the text from the pdf, even with eventual duplicates is a big help already! Thank you for the project!

Commit 04fd56a (available in develop and in the next release) provides a Page.dedupe_chars(...) method that should address this general type of character duplication. (Thanks to @xv44586 for the PDF and test!) I'm closing this issue for now, but if anyone encounters character-duplication issues that the new method does not solve, feel free to comment on this thread. Priority will be given to comments containing a specific PDF and code that demonstrate the problem.