The framework is enables to process documents and text by extracting the documents content, annotating and translating the text, and validating the output.
The framework is developed in TypeScript, but can be easily used on NodeJS.
The service is based on the qtopology module, which is a distributed stream processing layer and is able to construct components for adding them to the tool.
-
Create
.env
file in theenv
folder. See instructions described in this readme. -
node.js v6.0 and npm 5.3 or higher
To test that your node.js version is correct, run
node --version
andnpm --version
.
To install the project run
npm install
The pipeline uses a nodejs module called textract which allows text extraction of most of text files. For some file types additional libraries need to be installed:
- PDF extraction requires
pdftotext
be installed, link. - DOC extraction requires
antiword
be installed, link, unless on OSX in which case textutil (installed by default) is used.
To build the project and use the developed components run
npm run build
- Bolts. The bolts used to process the documents and text.
- Spouts. The spouts used to retrieve the metadata and send them to the bolts.
- Ontologies. The ontologies definition and examples.
This work is developed by AILab at Jozef Stefan Institute.
The work is supported by the X5GON, a project that connects OER repositories and provides services to improve the educational process.