afcarl / book-segmentation

Labeled segmentation for the document structure of printed books

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

book-segmentation

Data, code and trained models to segment the document structure of printed books and label each segment according to ten categories:

  • Title page (including half titles)
  • Ad card (advertisements)
  • Publisher information
  • Dedication
  • Preface
  • Table of contents
  • Text
  • Appendix
  • Index
  • N/A

Data, categorization system, and models described in more detail here:

Lara McConnaughey, Jennifer Dai and David Bamman (2017), "The Labeled Segmentation of Printed Books" (EMNLP 2017)

This model makes use of data from Ted Underwood's DataMunging repo

Usage

To segment a book from the HathiTrust named book.zip using the default model: python code/segment_book.py book.zip models/labseg10/

This should output a list of page numbers and labels for all pages in book.zip.

Dependencies

Numpy (pip install numpy --user), scipy (pip install scipy --user) and Tensorflow 1.0

About

Labeled segmentation for the document structure of printed books


Languages

Language:Python 100.0%