Repeating characters

Question

Repeating characters

samkit-jain opened this issue 6 years ago · comments

I'm facing a weird problem wherein characters are repeated when using extract_text() or extract_tables(). Example, SSttaatteemmeenntt ooff AAccccoouunnttss is printed instead of Statement of Accounts.

Sometimes, it happens in a portion of the PDF and sometimes in the whole PDF. When this happens in a portion of PDF, it is fixable (not completely) via extract_text(x_tolerance=0, y_tolerance=0) but not when the issue affects the whole PDF. Also, note that I do not face this issue in all PDFs but in some.

Lines are also repeated. Example,

Year-to-date totals do not reflect any fee or interest refunds
Year-to-date totals do not reflect any fee or interest refunds
you may have received.
you may have received.

Samkit Jain · Answer 1 · Tue Jul 31 2018 14:50:11 GMT+0800 (China Standard Time)

On doing first_page.extract_words(x_tolerance=0, y_tolerance=0), there are two instances of a single word

{'x0': Decimal('231.532'), 'x1': Decimal('252.251'), 'top': Decimal('916.343'), 'bottom': Decimal('925.422'), 'text': 'reflect'}
{'x0': Decimal('231.533'), 'x1': Decimal('252.252'), 'top': Decimal('916.383'), 'bottom': Decimal('925.462'), 'text': 'reflect'}

And repeating characters are still present for some words,

{'x0': Decimal('489.040'), 'x1': Decimal('506.160'), 'top': Decimal('269.320'), 'bottom': Decimal('277.480'), 'text': 'ttooddaayy'}

Jeremy Singer-Vine · Answer 2 · Wed Aug 01 2018 08:37:00 GMT+0800 (China Standard Time)

That's strange, indeed. My hunch is that there really are two copies of each letter in the PDF. (One set of letters might be transparent, perhaps?) What happens if you try extracting the text with another tool, such as poppler-utils's pdftotext? (https://en.wikipedia.org/wiki/Pdftotext)

Samkit Jain · Answer 3 · Wed Aug 01 2018 13:03:05 GMT+0800 (China Standard Time)

No such problem with pdftotext. This is the output,

No repeating lines

Year-to-date totals do not reflect any fee or interest refunds
you may have received.

No repeating characters

today

Statement of Accounts

Jacob Fenton · Answer 4 · Thu Aug 02 2018 00:46:53 GMT+0800 (China Standard Time)

I've encountered this problem as well. In my case it was cropping up in fillable pdfs, and I theorized that the folks filling out the pdf were somehow resaving it on top of the original text. I found it was easier to just remove duplicate characters via script than make sense of the pdf. I dunno for sure, I suspect that other pdf output tools are removing duplicate characters.

I'm not really sure what the right solution is, but possibly adding a 'remove duplicate characters' option would make this more manageable? My case involved exact matches--characters occurring at exactly the same spot--so a fix was easy... I suppose if they were slightly offset it would be more challenging.

NaveenBandi · Answer 5 · Thu Apr 11 2019 13:53:09 GMT+0800 (China Standard Time)

Getting same issue, please pass some resolution

Kyeonghoon Koo (Bryan Koo) · Answer 6 · Sat Aug 22 2020 02:38:38 GMT+0800 (China Standard Time)

AFAIK, duplicated characters are also for bold representation and there will be cases with small offset.
Deduplication is possible by checking overlap ratio of all characters using coordinates.

Tiago Samaha · Answer 7 · Mon Aug 24 2020 19:01:03 GMT+0800 (China Standard Time)

Any solution to it? I have the same issue.

Hanny · Answer 8 · Tue Sep 01 2020 01:36:01 GMT+0800 (China Standard Time)

I recently stumbled across this issue - just tossing it out there to let folks know it's a continuing thing.

Samkit Jain · Answer 9 · Tue Sep 01 2020 02:23:51 GMT+0800 (China Standard Time)

@hannylicious and other watchers of this issue, if you have a PDF with this issue that you can share publicly, please do so that this issue can be investigated in further detail.

I am pretty sure I have a PDF with this issue but it will take me some time to find it.

Hanny · Answer 10 · Tue Sep 01 2020 03:17:38 GMT+0800 (China Standard Time)

Unfortunately - I dabble with PDF's very infrequently. I just happened across it this time because another library (pyPDF2) didn't see any text at all - whereas pdfplumber saw the text, but it was duplicated. The PDF I'm working with at this time has some information that I can't publicly display so I won't be of much assistance I'm afraid.

I resolved my use case simply by grabbing the first of the results and using that.

Pdfplumber is a great tool - I will most likely be using this from now on! If I run across this issue on a PDF that I can link up - I definitely will!

Jeremy Singer-Vine · Answer 11 · Tue Sep 01 2020 10:04:37 GMT+0800 (China Standard Time)

Thanks, @hannylicious! If you have the time, you could try using https://github.com/JoshData/pdf-redactor to remove the sensitive information without altering the PDF structure. If the result still produces the same character-duplication, then it could be very helpful for resolving this issue.

Hanny · Answer 12 · Wed Sep 02 2020 04:23:35 GMT+0800 (China Standard Time)

Thanks @jsvine - I will definitely have a look at that pdf-redactor library. If it works - I'll be sure and post that PDF here!

Tiago Samaha · Answer 13 · Wed Sep 02 2020 04:31:15 GMT+0800 (China Standard Time)

I would like to help, but my file has confidential content. Anyone have some issue file?

Pablo Andretta Jaskowiak · Answer 14 · Sat Sep 12 2020 04:08:24 GMT+0800 (China Standard Time)

Same issue here.

Jeremy Singer-Vine · Answer 15 · Sun Sep 27 2020 01:05:02 GMT+0800 (China Standard Time)

@pajaskowiak Can you share a PDF that demonstrates the issue?

xv44586 · Answer 16 · Mon Sep 28 2020 11:19:43 GMT+0800 (China Standard Time)

repeat.pdf
Getting samge issue, the pdf is repeat.pdf

mkl · Answer 17 · Mon Sep 28 2020 15:43:20 GMT+0800 (China Standard Time)

The duplicate text indeed is drawn twice in the PDF, the second time with a small horizontal offset to create the appearance of a bold font.
Actually, though, this PDF gives a hint that the second copy shall be ignored by marking it with an empty ActualText property. By evaluating that property, therefore, pdfplumber could correctly extract this PDF.

Jeremy Singer-Vine · Answer 18 · Tue Sep 29 2020 10:06:55 GMT+0800 (China Standard Time)

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

mkl · Answer 19 · Tue Sep 29 2020 15:00:33 GMT+0800 (China Standard Time)

Indeed, there are many PDFs out there drawing text twice for some visual effect (bold, shadow, ...) but by far not all of them use ActualText to mark one copy as ignorable like @xv44586's example file does. Thus, finding duplicates explicitly will help more often in this regard than checking the ActualText.

Pablo Andretta Jaskowiak · Answer 20 · Tue Sep 29 2020 20:42:41 GMT+0800 (China Standard Time)

@pajaskowiak Can you share a PDF that demonstrates the issue?

I'm really sorry but I can't. It contains sensitive information.

Pablo Andretta Jaskowiak · Answer 21 · Tue Sep 29 2020 20:44:33 GMT+0800 (China Standard Time)

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

I did something similar to this. Anyways, I could fix the duplicates in my own code. Having the text from the pdf, even with eventual duplicates is a big help already! Thank you for the project!

Jeremy Singer-Vine · Answer 22 · Sun Oct 04 2020 00:17:28 GMT+0800 (China Standard Time)

Commit 04fd56a (available in develop and in the next release) provides a Page.dedupe_chars(...) method that should address this general type of character duplication. (Thanks to @xv44586 for the PDF and test!) I'm closing this issue for now, but if anyone encounters character-duplication issues that the new method does not solve, feel free to comment on this thread. Priority will be given to comments containing a specific PDF and code that demonstrate the problem.