A Rectangle must have a non-negative width (from RegEx text detection)

Question

A Rectangle must have a non-negative width (from RegEx text detection)

DrPlanecraft opened this issue 8 months ago · comments

Hello Again!

I have an issue similar to my previous report. However, this time it is RegularExpressionTextExtraction, It has passed through SimpleLineOfTextExtraction

to reproduce, Run the following code without the try-catch:

from borb.toolkit import RegularExpressionTextExtraction
from borb.pdf import PDF
missedMatches = [('lactobacilli', ('that', 'lactobacilli', '–', 'good', 'bacteria')), ('–', ('that', 'lactobacilli', '–', 'good', 'bacteria')), ('system', ('that', 'live', 'in', 'the', 'digestive', 'system')), ('Shirota-', ('root', 'of', 'the', 'business', 'activities.', 'In', 'addition', 'to', 'these', 'core', 'ideas,', 'Shirota-')), ('ism', ('ism', 'also', 'encompasses', 'the', 'virtues', 'of', 'sincerity,', 'care', 'for', 'the', 'community,')), ('price', ('A', 'price', 'anyone')), ('can', ('can', 'afford')), ('afford', ('can', 'afford')), ('–', ('–', 'were', 'able', 'to', 'inhibit', 'the', 'growth')), ('6', ('6',)), ('7', ('7',)), ('L.', ('L.', 'casei', 'strain', 'Shirota')), ('exclusive', ('is', 'exclusive', 'only', 'to', 'Yakult')), ('discovered', ('discovered', 'by', 'our')), ('exclusive', ('Shirota.', 'It', 'is', 'exclusive')), ('cannot', ('cannot', 'be', 'found', 'in')), ('found', ('cannot', 'be', 'found', 'in')), ('any', ('any', 'other', 'cultured')), ('other', ('any', 'other', 'cultured')), ('drinks.', ('milk', 'drinks.')), ('–', ('–',)), ('Intestinal', ('A', 'Healthy', 'Intestinal')), ('Tract,', ('Tract,', 'Healthy', 'Life')), ('Life', ('Tract,', 'Healthy', 'Life')), ('Masses', ('the', 'Masses')), ('F', ('F',)), ('I', ('I',)), ('R', ('R',)), ('S', ('S',)), ('T', ('T',)), ('P', ('P',)), ('R', ('R',)), ('O', ('O',)), ('D', ('D',)), ('U', ('U',)), ('C', ('C',)), ('E', ('E',)), ('D', ('D',)), ('I', ('I',)), ('N', ('N',)), ('1', ('1',)), ('9', ('9',)), ('3', ('3',)), ('5', ('5',))]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
with open("Artwork 2.pdf","rb") as wkwk:
    artwork = PDF.loads(file=wkwk,event_listeners=[])

print(artwork.get_page(0))
print("\nArtwork:\n")
for word, sentence in missedMatches: # Artwork Matches missed
    sentence = " ".join(sentence).strip().replace("'","’").replace('-', "–")
    print(sentence)

    try:
        extractedSentence = RegularExpressionTextExtraction(sentence)
    except AssertionError:
        print("triggered")
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_x())
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_y())
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_height())
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_width())
        
    print(extractedSentence)
    print("\n")

Traceback (most recent call last):
  File "C:\Users\Lenovo\OneDrive\Documents\LI ZHUOXI\ITE- College West\Lessons\Industrial Attachment Program\IAP Higher Nitec AI Applications\HumanKind Design Pte Ltd\AI_Proofreading\operations.py", line 185, in findOnPDF
    extractedSentence = RegularExpressionTextExtraction(sentence).get_matches_for_pdf(sentence, self.artwork)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 371, in get_matches_for_pdf
    CanvasStreamProcessor(page, Canvas(), []).read(page_source, [cse])
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\canvas_stream_processor.py", line 305, in read
    raise e
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\canvas_stream_processor.py", line 299, in read
    operator.invoke(self, operands, event_listeners)
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\operator\text\show_text.py", line 49, in invoke
    l._event_occurred(tri)
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 322, in _event_occurred
    self._render_text(event)
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 334, in _render_text
    for e in text_render_info.split_on_glyphs():
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\event\chunk_of_text_render_event.py", line 172, in split_on_glyphs
    e._baseline_bounding_box = Rectangle(
                               ^^^^^^^^^^
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\geometry\rectangle.py", line 29, in __init__
    assert width >= 0, "A Rectangle must have a non-negative width."
           ^^^^^^^^^^
AssertionError: A Rectangle must have a non-negative width.

Expected behaviour
I want to get the locations of all regex matches so i can draw boxes on the PDF itself

Desktop (please complete the following information):

OS: Windows 11
borb version 2.1.19.2
Artwork 2.pdf

Additional context
Edit: replaced the linked document with a mostly valid document

Joris Schellekens · Answer 1 · Wed Nov 15 2023 18:58:28 GMT+0800 (China Standard Time)

Can you please make your example as minimal as possible? Rather than attempting to match everything in the list for instance, you could limit your example to the first failing match.

Kind regards,
Joris Schellekens

DrPlanecraft · Answer 2 · Wed Nov 15 2023 19:42:31 GMT+0800 (China Standard Time)

Can you please make your example as minimal as possible? Rather than attempting to match everything in the list for instance, you could limit your example to the first failing match.

Kind regards, Joris Schellekens

Do I need to cut down on the PDF aswell?

If not, here is the updated code:

from borb.toolkit import RegularExpressionTextExtraction
from borb.pdf import PDF
missedMatches = [('lactobacilli', ('that', 'lactobacilli', '–', 'good', 'bacteria'))]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
with open("Artwork 2.pdf","rb") as file:
    artwork = PDF.loads(file=file)

for word, sentence in missedMatches: # Artwork Matches missed
    sentence = " ".join(sentence).strip().replace("'","’").replace('-', "–")
    print(sentence)

    extractedSentence = RegularExpressionTextExtraction(sentence).get_matches_from_pdf(sentence,artwork)
        
    print(extractedSentence)
    print("\n")

I apologise for any syntax/formating errors as I am writing this reply on a mobile phone

DrPlanecraft · Answer 3 · Thu Nov 16 2023 11:21:45 GMT+0800 (China Standard Time)

@jorisschellekens, I have made an edit to the main post updating the linked PDF, I realise that the previous PDF had 0 bytes producing a separate error

Joris Schellekens · Answer 4 · Tue Nov 28 2023 03:54:03 GMT+0800 (China Standard Time)

In the latest version of borb this does not throw an error: