microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Table annotations within full page images

alfassy opened this issue · comments

Hi!
I would like to train a model with your data but having the Structure annotations of rows, columns, etc. in the full pages images which are found in the Detection data.
Going over your data, I couldn't find any mapping between the annotations in the Detection and Structure datasets, do you have any such mapping? maybe in the script that creates the annotations? I would really appreciate the help and we can publish that together afterwards.
Thank you,
Amit

Hi,

In theory the structure annotations on top of the full-page table detection images should be recoverable from the PDF-Annotations data.

However, something to note is that for PubTables-1M, a very small percentage of tables in the current Structure dataset will not be able to be included in a full-page structure dataset.

The reason for this is that a full-page image exists in the Detection dataset only if every table on that page was able to be recognized during the dataset creation stage. Sometimes the dataset creation script could recognize at least one table on a page but not necessarily all tables on the page. In this case, any successfully recognized table is included in the Structure dataset, but the full page is excluded from the Detection dataset, since it would only be partially annotated.

I believe I should be able to write a script to create a full-page Structure dataset using the Detection data and the PDF-Annotations data. I'll give it a try and share it if it's successful.

Best,
Brandon