pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

Home Page:https://pdfminersix.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parse Image on the fly

FrsECM opened this issue · comments

Feature request
Hello !
I'm using pdfminer to parsefiles that will be use in a RAG Pipeline.

Context

In order to do that, i would like to do two things :

  • Image captioning on the fly
  • Table Parsing on the fly

Currently, theses two components are hard to parse on the fly.

Images

In order to process images, i would like to create PIL image on the fly from LTImage.
Currently, the ImageWriter class takes a folder as input.
=> It would be great that there is another class ImageWriterPIL, that generate the PIL Image.
The ImageWriter would just override this behaviour to save the PIL image. Not more.

Tables

Sometimes, it's challenging to detect and parse tables. The fix is not as easy as the one for images.
Especially when there is multiple table on the same page.

For example this file :
Mathematical Foundations of Image Processing and Analysis 2 - 2014 - Pinoli - Table of Acronyms.pdf

We have multiple tables on the same page, and the result is a lot of LTRect/LTTextBoxes that are complicate to understand.
It would be great if we can (optinally) have a layout component "LTTable" to handle that.
We can do this by :

All of this can be done in the library or not, but at least we need :

  • Render a page as a PIL image.

It's a feature that is available in pdfreader library. But i prefer pdfminer for a lot of other things.