PDF parsing toolkit for preparing text corpus

Introduction

This repo contains a PDF parsing toolkit for preparing text corpus to transfer PDF to Markdown. Based on PDF Parser ToolKits, gathering most-use PDF OCR tools for academic papers, and inspired by grobid_tei_xml, an open-sourced PyPI package, we develop sciparser 1.0 for text corpus pre-processing, in recent works like K2 and GeoGalactica, we use this tool and upgrade grobid backend solution to pre-process the text corpus. Moreover, the online demo is publicly available.

Try DEMO

In this repo and demo, we only share the secondary processing solution on Grobid. In the near future, we will share the multiple-backend combination solution on PDF parsing.

Requirements

git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install

git clone https://github.com/davendw49/sciparser.git
cd sciparser
pip install -r requirements.txt

Usage

python

First we should clone the hold repo.

git clone https://github.com/davendw49/sciparser.git

Then import the pipeline file to do the parsing.

from pipeline import pipeline
data = pipeline('/path/to/your/pdf/')

gradio

python main.py

Citation

@misc{sciparser,
  author = {Cheng Deng},
  title = {Sciparser: PDF parsing toolkit for preparing text corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/davendw49/sciparser}},
}

Reference

PDF Parser ToolKits: https://github.com/Acemap/pdf_parser
TEI-XML Parser (grobid_tei_xml): https://gitlab.com/internetarchive/grobid_tei_xml

About

PDF parsing toolkit for preparing academic text corpus

https://sciparser.acemap.info

large-language-models pdf-parser

Apache License 2.0

Languages

Language:Python 100.0%