paperswithcode / sota-extractor

The SOTA extractor pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create a table extractor

rstojnic opened this issue · comments

Create a function that takes as input a LaTeX file, and extracts all tables in a consistent format.

Perhaps the right output format is a list of rows, as this is how the tables are specified within LateX.

I have evaluated the different table extraction strategies. The options are:

  1. Extract from the PDF - this works extremely poorly and is not worth following up
  2. Extracting from Latex using latexml - according to wikipedie athis seems to work in about 90% of the cases (60% with no errors). But my experience is that it's closer to 50%. When it doesn't work, we essentially get nothing out. The Pros of using this approach is that when it works it gives us structured information about the latex document. The cons is that it doesn't work as frequently as we want it to.
  3. Extracting from Latex using our parser - table syntax is not that complex, and it should be possible to extract tables with high level of precision from latex using a relatively simple base parser. The pros are that we can get very high rate of extraction of tables using not much work. The cons is that we don't get any other information in a structured way, so it would be difficult to extract additional structure from the document.

Final conclusion: going with a homegrown parser seems to be the best solution, as it would enable us to extract the largest number of tables. In terms of extracting tasks and semantically interpreting those tables that would have to be done in a separate step down the line. This step could use the latexml output if available, or use the simple detex output.