c-jordi / pdf2data

A pdf segmentation and annotation tool for archival documents.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pdf2data : A pdf segmentation and annotation tool for archival documents.

๐Ÿšง IN DEVELOPMENT ๐Ÿšง

๐Ÿ’ก Vision

Develop an open source and user-friendly tool for technical and non-technical users that performs page, block & textline segmentation and combines both manual & automatic annotation.

๐ŸŽฅ Preview

Preview

๐Ÿ”ฅ Features

  • Structure your work into project and case studies.
  • Upload your pdf files.
  • Annotate the results of the segmentation algorithm using the interactive dashboard.
  • Automate the training of a classification algorithm.
  • Export your results for further analysis.

๐Ÿš€ Quickstart

Development

Start the message broker:

docker-compose up

Start the backend:

source server/venv/bin/activate
make run

Start the worker:

source server/venv/bin/activate
make worker

Start the client:

cd client
yarn start

๐Ÿงฎ Data Composition

Data composition

๐Ÿ“š Stack

Architecture

  • Node.js and React.js deliver the interactive dashboard.
  • Tornado runs the data backend.
  • Celery with a RabbitMQ backend operates the execution of asynchronous tasks.

About

A pdf segmentation and annotation tool for archival documents.

License:MIT License


Languages

Language:Python 50.0%Language:JavaScript 37.6%Language:SCSS 9.9%Language:HTML 0.8%Language:TypeScript 0.7%Language:Makefile 0.5%Language:Dockerfile 0.4%Language:CSS 0.2%Language:Shell 0.0%