allenai / science-parse

Science Parse parses scientific papers (in PDF form) and returns them in structured form.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Capturing Footers/Metadata in Section Text

jmadeano opened this issue · comments

After some brief experimentation, it seems science-parse does a great job dealing with unicode issues and the parsing of chemical formulae for full text extraction. However, there are multiple cases of extracted text including footers. For example, searching for

"NATURE COMMUNICATIONS | 5:3949 | DOI:"

in the parsed output (txt) yields multiple results stitched within the section content. The source of this text can easily be seen in this pdf. This behavior can be seen in most attempted parsed PDFs (seemingly independent of publisher/date), but it is quite difficult to create rules that will remove these inclusions post-parsing. Is this a known issue/limitation of science-parse, and if so is there a work-around to intelligently ignore footers for different publishers/paper types?

I can upload more PDFs/parsed examples if that would be helpful. Thanks!

Edit: Reuploaded txt file with improved readability/formatting.

This is a known limitation of science-parse in the sense that we know that body text extraction is kind of crappy. We use it for indexing in a search engine, where it doesn't matter so much whether we have super clean body text. There is some talk about improving it, so the body text becomes more useful in other circumstances, but no work has been scheduled. If we do tackle this problem, it will be as part of https://github.com/allenai/spv2.

In summary, I can't commit to any sort of timeline for fixing it. But I can commit to helping with pull requests that address this problem, either here or better, at https://github.com/allenai/spv2.