BUG: ImageExtraction not extracting all the images in pdf

Question

BUG: ImageExtraction not extracting all the images in pdf

luojunhui1 opened this issue a year ago · comments

Describe the bug
not extracting all the images in pdf

To Reproduce

For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8

# read the Document
    doc: typing.Optional[Document] = None
    text_l: SimpleTextExtraction = SimpleTextExtraction()
    image_l: ImageExtraction = ImageExtraction()

    with open(file_path, "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [text_l, image_l])

    # check whether we have read a Document
    assert doc is not None

    images = []

    for page in range(0, 9):
        if "XObject" in doc.get_page(page)["Resources"]:
            for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                print("%d\t%s" % (page, k))
    
    for page, content in image_l.get_images().items():
        images += (content)
        print(f"image page: {page}")

Expected behaviour
the ImageExtraction listenser should return all the images

Screenshots

Desktop (please complete the following information):

OS: Windows10
borb version 2.1.10

Additional context
Add any other context about the problem here.

Joris Schellekens · Answer 1 · Mon May 01 2023 03:53:49 GMT+0800 (China Standard Time)

Please attach the input PDF

luojunhui · Answer 2 · Mon May 01 2023 15:20:54 GMT+0800 (China Standard Time)

@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below

def test_pdf_with_borb(self):
        doc: typing.Optional[Document] = None
        text_l: SimpleTextExtraction = SimpleTextExtraction()
        image_l: ImageExtraction = ImageExtraction()

        file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
        with open(file_path, "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [text_l, image_l])

        # check whether we have read a Document
        assert doc is not None

        images = []
        page_num = int(doc.get_document_info().get_number_of_pages())
        print(f"page num: {page_num}")

        for page in range(0, page_num):
            if "XObject" in doc.get_page(page)["Resources"]:
                for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                    print("%d\t%s" % (page, k))
        
        for page, content in image_l.get_images().items():
            images += (content)
            print(f"image page: {page}")

the test output screenshot is

input_doc2.pdf

Joris Schellekens · Answer 3 · Mon May 01 2023 16:36:05 GMT+0800 (China Standard Time)

I checked the images in your PDF.
It turns out borb does not currently support them (yet).
That's why they are not extracted.

luojunhui · Answer 4 · Mon May 01 2023 23:28:02 GMT+0800 (China Standard Time)

what can i do to extract these images correctly? could you give me any advice, thanks a lot

Joris Schellekens · Answer 5 · Wed May 03 2023 00:49:39 GMT+0800 (China Standard Time)

You would have to implement your own version of an ImageTransformer (package io and read).

Essentially you need to:

identify when this transformer needs to be triggered
what this transformer needs to do to convert the raw bytes to a PIL Image