eriknovak / DTProc

The framework for extracting documents metadata, content, annotations, etc. Implemented using @qminer/qtopology.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DTProc: Document and Text Processing Framework

The framework is enables to process documents and text by extracting the documents content, annotating and translating the text, and validating the output.

The framework is developed in TypeScript, but can be easily used on NodeJS.

The service is based on the qtopology module, which is a distributed stream processing layer and is able to construct components for adding them to the tool.

Prerequisites

  • Create .env file in the env folder. See instructions described in this readme.

  • node.js v6.0 and npm 5.3 or higher

    To test that your node.js version is correct, run node --version and npm --version.

Install

To install the project run

npm install

Textract Dependencies

The pipeline uses a nodejs module called textract which allows text extraction of most of text files. For some file types additional libraries need to be installed:

  • PDF extraction requires pdftotext be installed, link.
  • DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used.

Build

To build the project and use the developed components run

npm run build

Table of Contents

  • Bolts. The bolts used to process the documents and text.
  • Spouts. The spouts used to retrieve the metadata and send them to the bolts.
  • Ontologies. The ontologies definition and examples.

Acknowledgments

This work is developed by AILab at Jozef Stefan Institute.

The work is supported by the X5GON, a project that connects OER repositories and provides services to improve the educational process.

About

The framework for extracting documents metadata, content, annotations, etc. Implemented using @qminer/qtopology.

License:BSD 2-Clause "Simplified" License


Languages

Language:JavaScript 33.9%Language:HTML 32.6%Language:TypeScript 27.5%Language:Rich Text Format 5.9%Language:Julia 0.1%Language:CSS 0.0%