PDF Scanned Files Handling

Question

PDF Scanned Files Handling

PascalSun opened this issue a month ago · comments

Is your feature request related to a problem? Please describe.
Some of the PDF data are actually scanned, which is hard to handle.
What we will need to do is using the state of art models/ways to get this involved into the KG.

Describe the solution you'd like

OCR for text detection is fine, so we can do is first generating the text
Detecting the images and tables from the scanned pdf is still challenging
- So we are not going to make extracted for now, we will directly hook the page image to the page.

Pascal Sun · Answer 1 · Fri May 31 2024 14:16:38 GMT+0800 (China Standard Time)

We will need to have the layout kg generation process a bit different than normal ones.