PDF Scanned Files Handling
PascalSun opened this issue · comments
Is your feature request related to a problem? Please describe.
Some of the PDF data are actually scanned, which is hard to handle.
What we will need to do is using the state of art models/ways to get this involved into the KG.
Describe the solution you'd like
- OCR for text detection is fine, so we can do is first generating the text
- Detecting the images and tables from the scanned pdf is still challenging
- So we are not going to make extracted for now, we will directly hook the page image to the page.
We will need to have the layout kg generation process a bit different than normal ones.