jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.

Home Page:https://borbpdf.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG KeyError when trying to use `SimpleFindReplace.sub`

guwidoe opened this issue · comments

Describe the bug
I'm using the following code to try to find and replace a short string (a phone number) in a bunch of pdf files:

#...

# check whether we actually read a PDF
assert doc is not None

# find/replace
doc = SimpleFindReplace.sub(re.escape("(+34) 902 431250"), 
                            re.escape("1234 1234 1234"), doc)
#...

and I get a KeyError:

"...\Python\Python310\site-packages\borb\toolkit\text\simple_find_replace.py", line 77, in sub
    page.apply_redact_annotations()
  File "...\Python\Python310\site-packages\borb\pdf\page\page.py", line 145, in apply_redact_annotations
    for x in self["Annots"]
KeyError: 'Annots'

To Reproduce
Steps to reproduce the behaviour:

import os
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleFindReplace
import typing
import re


def process_pdf_file(input_file: str, output_file: str):
    # attempt to read a PDF
    doc: typing.Optional[Document] = None
    with open(input_file, "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle)

    # check whether we actually read a PDF
    assert doc is not None

    # find/replace
    doc = SimpleFindReplace.sub(re.escape("(+34) 902 431250"), 
                                re.escape("1234 1234 1234"), doc)

    # store
    with open(output_file, "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

def reproduce_bug():
    input_file = r"...full_in_file_path"
    output_file = r"...full_out_file_path"

File to reproduce: https://www.articsa.net/francais/productes/fitxes/07132fs.pdf

Expected behaviour
I expect the occurrence of "(+34) 902 431250" in the file to be replaced with "1234 1234 1234" in the output file.

Desktop:

  • OS: Windows 11
  • borb version: borb-2.1.10-py3-none-any.whl
  • input PDF: I can not share the file but it was created with Adobe Acrobat Pro DC by editing another PDF. I will try to create an example file which I can share.

PDF/A Validation Results
Validation results from https://demo.verapdf.org/ are shown below:

PDF/A Validation Results
Validation results are shown below.

File:	/tmp/cache7520751411204317480
Validation Profile:	PDF/A-1B validation profile
Compliance:	Failed
Statistics
Version:	
Parser:	GreenField
Build Date:	
Processing time:	00:00:00.028
Total rules in Profile:	101
Passed Checks:	7415
Failed Checks:	100
Validation information
Rule	Status
Specification: ISO 19005-1:2005, Clause: 6.1.7, Test number: 2	
The stream keyword shall be followed either by a CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) character sequence or by a single LINE FEED character. The endstream keyword shall be preceded by an EOL marker	Failed
12 occurrences	Show


Specification: ISO 19005-1:2005, Clause: 6.3.4, Test number: 1	
The font programs for all fonts used within a conforming file shall be embedded within that file, as defined in PDF Reference 5.8, except when the fonts are used exclusively with text rendering mode 3	Failed
4 occurrences	Show


Specification: ISO 19005-1:2005, Clause: 6.2.3, Test number: 4	
If an uncalibrated colour space is used in a file then that file shall contain a PDF/A-1 OutputIntent, as defined in 6.2.2	Failed
32 occurrences	Show


Specification: ISO 19005-1:2005, Clause: 6.2.3, Test number: 2	
DeviceRGB may be used only if the file has a PDF/A-1 OutputIntent that uses an RGB colour space	Failed
50 occurrences	Show


Specification: ISO 19005-1:2005, Clause: 6.1.8, Test number: 1	
The object number and generation number shall be separated by a single white-space character. The generation number and obj keyword shall be separated by a single white-space character. The object number and endobj keyword shall each be preceded by an EOL marker. The obj and endobj keywords shall each be followed by an EOL marker.	Failed
2 occurrences	Show

With formatting:
image

I tried fixing the error myself by adapting the library code. I looked for the apply_redact_annotations method in page.py (the Page class, and edited the following snippet:

        rectangles_to_redact: typing.List[Rectangle] = [
            Rectangle(
                x["Rect"][0],
                x["Rect"][1],
                x["Rect"][2] - x["Rect"][0],
                x["Rect"][3] - x["Rect"][1],
            )
            for x in self["Annots"]
            if "Subtype" in x and x["Subtype"] == "Redact" and "Rect" in x
        ]

like this:

        rectangles_to_redact: typing.List[Rectangle] = [
            Rectangle(
                x["Rect"][0],
                x["Rect"][1],
                x["Rect"][2] - x["Rect"][0],
                x["Rect"][3] - x["Rect"][1],
            )
            for x in (self["Annots"] if "Annots" in self else [])
            if "Subtype" in x and x["Subtype"] == "Redact" and "Rect" in x
        ]

checking if the key exists, thus avoiding the KeyError.

This fixes this particular error and the code runs without any further errors.
Unfortunately, the formatting of the output file is jumbled. That might have to do with PDF/A non-compliance.

Hi there,

The KeyError should be fixed in the next release.

However, your PDF seems to have quite some problems that would make it hard for a strict parser (e.g. borb) to process your document.

While I understand that you may not have control over the PDF documents you are processing (they may be documents you yourself have received from an external party) I am sure you can appreciate that it is easier for the development of borb if we set some guidelines.

Kind regards,
Joris Schellekens

Hi @jorisschellekens,

Thanks for the quick reply.

Indeed, I have no control over the (malformed) PDF documents I'm working with and I understand that borb seems to be the wrong tool in this case, thank you nonetheless for providing it!

I have since tried some other solutions and found that this seems to work excellently: https://kb.aspose.com/pdf/python/how-to-find-and-replace-text-in-pdf-using-python/ retaining the rest of the document unaltered.

However, it involves the paid library Aspose.PDF. As an expert on this matter, are you aware of another tool (python library) that could be used for this job, or do you think this paid library is my only realistic option?

Hi @guwidoe

I'm afraid I don't keep track of what other libraries provide or what they are capable of.

Kind regards,
Joris Schellekens