impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.

Home Page:https://impresso.github.io/impresso-text-acquisition/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Impresso Text Importer

Documentation Status PyPI version PyPI - License

The Impresso TextImporter is a library and a collection of scripts to import newspaper data from a variety of formats (e.g. Olive XML, various flavors of Mets/Alto XML, etc.) into Impresso’s JSON format.

Please refer to the documentation for further information on this library.

Installation

With pip:

pip install impresso-text-importer

License

The second project 'impresso - Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio' is funded by the Swiss National Science Foundation (SNSF) under grant number CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Aiming to develop and consolidate tools to process and explore large-scale collections of historical newspapers and radio archives, and to study the impact of this tooling on historical research practices, Impresso II builds upon the first project – 'impresso - Media Monitoring of the Past' (grant number CRSII5_173719, Sinergia program). More information at https://impresso-project.ch.

Copyright (C) 2024 The impresso team (contributors to this program: Matteo Romanello, Maud Ehrmann, Alex Flückinger, Edoardo Tarek Hölzl, Pauline Conti).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU Affero General Public License for more details.

About

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.

https://impresso.github.io/impresso-text-acquisition/

License:GNU Affero General Public License v3.0


Languages

Language:Jupyter Notebook 67.0%Language:Python 33.0%