kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Home Page:https://grobid.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get section hierarchy from fulltext?

ValeKnappich opened this issue · comments

Hi,

I am using grobid to extract the pdf full text (/processFulltextDocument).
It works great except that all sections are put on the same level and there doesn't seem to be a way to extract the hierarchical relationships.

Is there any way to do that with grobid?

Thanks

Hi @ValeKnappich !

Yes, currently the sections are "flat". In the past, grobid was actually creating a hierarchy of sections, but it was not working well and could lead to well-formedness problems in the resulting XML (although not frequent). Given that it was not reliable, it was removed until something better is done for supporting this feature.

See issue #377

Hi @kermitt2!

Thanks for the reply. As the Issue #377 is from 2019, I assume the problem is not going to be resolved any time soon.

Do you know any good workarounds for this? I tried using pdfplumber to get all text, find the heading and compare font sizes. But its somewhat tedious and is far from perfect.

Yes @ValeKnappich, it is not a priority, as compared to the new figure/table recognition approach for example.

In my previous approach, I was clustering the section headers based on font size, style and font name to try to identify header "levels" (all these font information in Grobid comes from pdfalto, similar to pdfplumber, but written in C++ and 20-50 times faster - scaling is one of the top requirement in Grobid). But as for you, it was far from perfect.

We could also use the PDF outline information (the kind of table of content embedded in PDF). Maybe in 50% of the cases when present, it gives a reliable section hierarchy with the coordinates of the sections headers. The problem is that in the other 50% of the cases, it is crap and noise, and would lead to errors, so I have disable also for the moment the usage of PDF outline information.

When present, the numbering gives also information about the hierarchy.

Maybe an interesting approach could be to combine all these features in a dedicated classifier, which would predict the hierarchical levels of a list of headers.

Thanks @kermitt2! If I decide to spend more time on this and find a decently working solution, I will post an update here.

@ValeKnappich I came across this issue randomly, but maybe my new python package grobidmonkey might help? to get the outline hierarchy you can simply do

from grobidmonkey import reader
monkeyReader = reader.MonkeyReader('monkey') # or 'lxml' or 'x2d'

# read paper outline
outline = monkeyReader.readOutline('/path/to/your/paper.pdf.tei.xml')

outline is an anytree.RenderTree object, to print that you can use

for pre, fill, node in outline:
    print("%s%s" % (pre, node.name))

and the output will be like

Article
├── 1 Introduction
├── 2 Proposed Method
│   ├── 2.1 ...
│   ├── 2.2 ...
│   └── 2.3 ...
├── 3 Experiments and Results
│   ├── 3.1 ...
│   ├── 3.2 ...
│   └── 3.3 ...
└── 4 Conclusion

My approach is based on the section 'index' in TEI_XML output, if you have any feedbacks please let me know. Hope this will help!

@com3dian thanks for reaching out.

Indeed, your approach works as long as the headings are numbered (I guess thats where the <head n= comes from).

However, thats not always the case for me.

You might want to account for that in your implementation in some way. At the moment, it will throw an error if the <head> tag does not have the attribute n.

@ValeKnappich Hi thanks for the feedback.

You are right, this package developed based the wrapped code I used in my own project, so there is likely some issues. Can you share some papers/TEI-XMLs that <head> do not include the attribute n so that I can see if I could possibly improve the package?

PS: I have also tried with fontsize and fonttype solution, I feel like if you include both feature in your classifier might help. I have seen some journal templates has subsection titles almost same fontsize as contents, but in that case they usually use another fonttype.