AI4WA / Docs2KG

Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models

Home Page:https://docs2kg.ai4wa.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PDF Scanned Files Handling

PascalSun opened this issue · comments

Is your feature request related to a problem? Please describe.
Some of the PDF data are actually scanned, which is hard to handle.
What we will need to do is using the state of art models/ways to get this involved into the KG.

Describe the solution you'd like

  • OCR for text detection is fine, so we can do is first generating the text
  • Detecting the images and tables from the scanned pdf is still challenging
    • So we are not going to make extracted for now, we will directly hook the page image to the page.

We will need to have the layout kg generation process a bit different than normal ones.