BUG KeyError when trying to use `SimpleFindReplace.sub`
guwidoe opened this issue · comments
Describe the bug
I'm using the following code to try to find and replace a short string (a phone number) in a bunch of pdf files:
#...
# check whether we actually read a PDF
assert doc is not None
# find/replace
doc = SimpleFindReplace.sub(re.escape("(+34) 902 431250"),
re.escape("1234 1234 1234"), doc)
#...
and I get a KeyError:
"...\Python\Python310\site-packages\borb\toolkit\text\simple_find_replace.py", line 77, in sub
page.apply_redact_annotations()
File "...\Python\Python310\site-packages\borb\pdf\page\page.py", line 145, in apply_redact_annotations
for x in self["Annots"]
KeyError: 'Annots'
To Reproduce
Steps to reproduce the behaviour:
import os
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleFindReplace
import typing
import re
def process_pdf_file(input_file: str, output_file: str):
# attempt to read a PDF
doc: typing.Optional[Document] = None
with open(input_file, "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we actually read a PDF
assert doc is not None
# find/replace
doc = SimpleFindReplace.sub(re.escape("(+34) 902 431250"),
re.escape("1234 1234 1234"), doc)
# store
with open(output_file, "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
def reproduce_bug():
input_file = r"...full_in_file_path"
output_file = r"...full_out_file_path"
File to reproduce: https://www.articsa.net/francais/productes/fitxes/07132fs.pdf
Expected behaviour
I expect the occurrence of "(+34) 902 431250" in the file to be replaced with "1234 1234 1234" in the output file.
Desktop:
- OS: Windows 11
- borb version: borb-2.1.10-py3-none-any.whl
- input PDF: I can not share the file but it was created with Adobe Acrobat Pro DC by editing another PDF. I will try to create an example file which I can share.
PDF/A Validation Results
Validation results from https://demo.verapdf.org/ are shown below:
PDF/A Validation Results
Validation results are shown below.
File: /tmp/cache7520751411204317480
Validation Profile: PDF/A-1B validation profile
Compliance: Failed
Statistics
Version:
Parser: GreenField
Build Date:
Processing time: 00:00:00.028
Total rules in Profile: 101
Passed Checks: 7415
Failed Checks: 100
Validation information
Rule Status
Specification: ISO 19005-1:2005, Clause: 6.1.7, Test number: 2
The stream keyword shall be followed either by a CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) character sequence or by a single LINE FEED character. The endstream keyword shall be preceded by an EOL marker Failed
12 occurrences Show
Specification: ISO 19005-1:2005, Clause: 6.3.4, Test number: 1
The font programs for all fonts used within a conforming file shall be embedded within that file, as defined in PDF Reference 5.8, except when the fonts are used exclusively with text rendering mode 3 Failed
4 occurrences Show
Specification: ISO 19005-1:2005, Clause: 6.2.3, Test number: 4
If an uncalibrated colour space is used in a file then that file shall contain a PDF/A-1 OutputIntent, as defined in 6.2.2 Failed
32 occurrences Show
Specification: ISO 19005-1:2005, Clause: 6.2.3, Test number: 2
DeviceRGB may be used only if the file has a PDF/A-1 OutputIntent that uses an RGB colour space Failed
50 occurrences Show
Specification: ISO 19005-1:2005, Clause: 6.1.8, Test number: 1
The object number and generation number shall be separated by a single white-space character. The generation number and obj keyword shall be separated by a single white-space character. The object number and endobj keyword shall each be preceded by an EOL marker. The obj and endobj keywords shall each be followed by an EOL marker. Failed
2 occurrences Show
I tried fixing the error myself by adapting the library code. I looked for the apply_redact_annotations
method in page.py (the Page
class, and edited the following snippet:
rectangles_to_redact: typing.List[Rectangle] = [
Rectangle(
x["Rect"][0],
x["Rect"][1],
x["Rect"][2] - x["Rect"][0],
x["Rect"][3] - x["Rect"][1],
)
for x in self["Annots"]
if "Subtype" in x and x["Subtype"] == "Redact" and "Rect" in x
]
like this:
rectangles_to_redact: typing.List[Rectangle] = [
Rectangle(
x["Rect"][0],
x["Rect"][1],
x["Rect"][2] - x["Rect"][0],
x["Rect"][3] - x["Rect"][1],
)
for x in (self["Annots"] if "Annots" in self else [])
if "Subtype" in x and x["Subtype"] == "Redact" and "Rect" in x
]
checking if the key exists, thus avoiding the KeyError.
This fixes this particular error and the code runs without any further errors.
Unfortunately, the formatting of the output file is jumbled. That might have to do with PDF/A non-compliance.
Hi there,
The KeyError
should be fixed in the next release.
However, your PDF seems to have quite some problems that would make it hard for a strict parser (e.g. borb
) to process your document.
While I understand that you may not have control over the PDF documents you are processing (they may be documents you yourself have received from an external party) I am sure you can appreciate that it is easier for the development of borb
if we set some guidelines.
Kind regards,
Joris Schellekens
Thanks for the quick reply.
Indeed, I have no control over the (malformed) PDF documents I'm working with and I understand that borb seems to be the wrong tool in this case, thank you nonetheless for providing it!
I have since tried some other solutions and found that this seems to work excellently: https://kb.aspose.com/pdf/python/how-to-find-and-replace-text-in-pdf-using-python/ retaining the rest of the document unaltered.
However, it involves the paid library Aspose.PDF
. As an expert on this matter, are you aware of another tool (python library) that could be used for this job, or do you think this paid library is my only realistic option?
Hi @guwidoe
I'm afraid I don't keep track of what other libraries provide or what they are capable of.
Kind regards,
Joris Schellekens