Capturing Footers/Metadata in Section Text
jmadeano opened this issue · comments
After some brief experimentation, it seems science-parse does a great job dealing with unicode issues and the parsing of chemical formulae for full text extraction. However, there are multiple cases of extracted text including footers. For example, searching for
"NATURE COMMUNICATIONS | 5:3949 | DOI:"
in the parsed output (txt) yields multiple results stitched within the section content. The source of this text can easily be seen in this pdf. This behavior can be seen in most attempted parsed PDFs (seemingly independent of publisher/date), but it is quite difficult to create rules that will remove these inclusions post-parsing. Is this a known issue/limitation of science-parse, and if so is there a work-around to intelligently ignore footers for different publishers/paper types?
I can upload more PDFs/parsed examples if that would be helpful. Thanks!
Edit: Reuploaded txt file with improved readability/formatting.
This is a known limitation of science-parse in the sense that we know that body text extraction is kind of crappy. We use it for indexing in a search engine, where it doesn't matter so much whether we have super clean body text. There is some talk about improving it, so the body text becomes more useful in other circumstances, but no work has been scheduled. If we do tackle this problem, it will be as part of https://github.com/allenai/spv2.
In summary, I can't commit to any sort of timeline for fixing it. But I can commit to helping with pull requests that address this problem, either here or better, at https://github.com/allenai/spv2.