huwan / pdftitle

PDF article title extraction tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pdftitle

The commandline tool pdftitle is a Python implementation of the SciPlore Xtract[1] paper, using mostly a structural layout analysis.

By now, Docear has published the open-source tool PDF Inspector which does roughly the same as this script. The differences are:

  • Written in Java
  • Uses PDFBox jPod instead of pdftohtml
  • Simplier heuristics

[1] Joeran Beel, Bela Gipp, Ammar Shaker, and Nick Friedrich. SciPlore Xtract: Extracting Titles from Scientific PDF documents by Analyzing Style Information (Font Size). In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz, editors, Research and Advanced Technology for Digital Libraries, Proceedings of the 14th European Conference on Digital Libraries (ECDL-10), volume 6273 of Lecture Notes of Computer Science (LNCS), pages 413-416, Glasgow (UK), September 2010. Springer.

Background

The title of a PDF article usually is in the filename but often is not. Next up would be to check the title of the PDF metadata (using e.g. pdfinfo) but this is also often not set or set incorrectly. Converting the PDF to text and picking the first line often gives false positives or incomplete titles.

Usage

$ pdftitle --help
usage: pdftitle [-h] [-r] [-m] [-s] [-t TOP_MARGIN] [-n MIN_LENGTH]
                [-x MAX_LENGTH] [-d] [-v]
                FILE [FILE ...]

Tries to identify the title of PDF format paper.

positional arguments:
  FILE                  Path to PDF file(s)

optional arguments:
  -h, --help            show this help message and exit
  -r, --rename          Rename file with found title
  -m, --multiline       Concatenate multiple title lines considered (default)
  -s, --singleline      Only use first title line considered
  -t TOP_MARGIN, --top-margin TOP_MARGIN
                        Top margin start to search for title (default: 70)
  -n MIN_LENGTH, --min-length MIN_LENGTH
                        Min. considerable title length (default: 15)
  -x MAX_LENGTH, --max-length MAX_LENGTH
                        Max. considerable title length (default: 250)
  -d, --debug           Print error stacktrace for unknown errors
  -v, --version         show program's version number and exit

Dependencies

  • Python >=2.5 (but < 3.x)

  • Poppler >=0.20.5 (contains pdftohtml)

    $ brew install poppler

  • lxml (optional, for higher accuracy)

    $ pip install lxml

Accuracy

Version 1.0: A sample set of 261 PDFs in Biology science (which has many scanned PDFs) results in 60.08% success rate.

Version 1.1: A sample set of 261 PDFs in Biology science (which has many scanned PDFs) results in 76.25% success rate.

Version 1.2: No comparison available. (I lost the original sample set)

Version 1.3: No comparison available. (I lost the original sample set)

License

pdftitle is licenced under a BSD License.

About

PDF article title extraction tool

License:Other


Languages

Language:Python 100.0%