jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.

Home Page:https://borbpdf.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG: ImageExtraction not extracting all the images in pdf

luojunhui1 opened this issue · comments

Describe the bug
not extracting all the images in pdf

To Reproduce

  1. For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
  2. the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8
# read the Document
    doc: typing.Optional[Document] = None
    text_l: SimpleTextExtraction = SimpleTextExtraction()
    image_l: ImageExtraction = ImageExtraction()

    with open(file_path, "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [text_l, image_l])

    # check whether we have read a Document
    assert doc is not None

    images = []

    for page in range(0, 9):
        if "XObject" in doc.get_page(page)["Resources"]:
            for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                print("%d\t%s" % (page, k))
    
    for page, content in image_l.get_images().items():
        images += (content)
        print(f"image page: {page}")

Expected behaviour
the ImageExtraction listenser should return all the images

Screenshots
image

Desktop (please complete the following information):

  • OS: Windows10
  • borb version 2.1.10

Additional context
Add any other context about the problem here.

Please attach the input PDF

@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below

def test_pdf_with_borb(self):
        doc: typing.Optional[Document] = None
        text_l: SimpleTextExtraction = SimpleTextExtraction()
        image_l: ImageExtraction = ImageExtraction()

        file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
        with open(file_path, "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [text_l, image_l])

        # check whether we have read a Document
        assert doc is not None

        images = []
        page_num = int(doc.get_document_info().get_number_of_pages())
        print(f"page num: {page_num}")

        for page in range(0, page_num):
            if "XObject" in doc.get_page(page)["Resources"]:
                for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                    print("%d\t%s" % (page, k))
        
        for page, content in image_l.get_images().items():
            images += (content)
            print(f"image page: {page}")

the test output screenshot is
image

input_doc2.pdf

I checked the images in your PDF.
It turns out borb does not currently support them (yet).
That's why they are not extracted.

what can i do to extract these images correctly? could you give me any advice, thanks a lot

You would have to implement your own version of an ImageTransformer (package io and read).

Essentially you need to:

  • identify when this transformer needs to be triggered
  • what this transformer needs to do to convert the raw bytes to a PIL Image