TransparencyToolkit/OCRServer

This is the software for the Transparency Toolkit OCR server. It receives data from the document upload form, OCRs the documents, and saves the results.

Install the following packages:

graphicsmagick
poppler-data
poppler-utils
ghostscript
tesseract-ocr
pdftk
libreoffice
openjdk-8-jdk
openjdk-8-jre
libcurl3
libcurl3-gnutls
libcurl4-openssl-dev
libmagic-dev
unoconv

Install the following gems:

mimemagic
docsplit
curb
ruby filemagic
pry
mail
listen
rubyzip

Install Apache Tika server by downloading the .jar from https://tika.apache.org/download.html
Start Tika by running: java -jar tika-server-1.18.jar
Optionally: Install ABBYY. This is not free software, but has higher quality OCR for some file types. Images and image-style PDFs as well as office documents that fail OCR with Tika will default to using ABBYY if it is installed. A license for the command line version can be purchased at https://www.ocr4linux.com/en:pricing:start. The OCR server will default to Tesseract if ABBYY is not installed.
Setup and start https://github.com/TransparencyToolkit/DocUpload Documents need to be uploaded for the rest to work.
Set the following environment variables:

OCR_IN_PATH: The path for documents and metadata to input
OCR_OUT_PATH: The path for documents to output
PROJECT_INDEX: The index name in elastic.

In this directory (for the OCRServer), run: ruby run_ocr.rb It will then listen for new documents in OCR_IN. This must be started BEFORE documents are uploaded.

Note: To run on an existing directory, set inotify_works = false in input_output/load_files.rb

TransparencyToolkit / OCRServer

About

Languages