GilbertoBotaro / scalable-ocr

Scalable Optical Character Recognition with Apache NiFi and Tesseract

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scalable OCR

Welcome to the project

So much of our data is represented as human readable scans of documents. However, this kind of document-by-document analysis does not scale, so it is becoming evermore common to need to ingest large numbers of PDFs or scanned documents shows up in almost all sectors. Inevitably these scanned documents must be converted to text for analysis. And since dealing with unstructured data is one of the main selling points for a platform like Hadoop, it means that we must convert large volumes of potentially large documents into a textual representation. We will show you how to use scalable open source tooling (Apache NiFi and Tesseract) to scalably convert volumes of PDFs and ingest into a platform that will allow you to analyze this data at scale.

Modules

Core Modules

  • conversion - convert multi-page PDFs to single-page TIFF files
  • preprocessing - image correction for better text extraction during OCR
  • extraction - OCR images and output text

Utility

  • CLI - command line tool for manual pipeline process execution
  • NiFi - custom processors for exposing the core modules via NiFi. Workflow template.

Developers

Cutting a release for ocr

mvn release:prepare -Dscm-connection.url=<scm readonly url> -Dscm-developer-connection.url=<scm read-write url>

Note: The main pom assumes "scm:git:" - simply pass in the URL portion as a build parameter as shown above.

Examples: [maven scm] (http://maven.apache.org/scm/git.html)

  1. local git - file://localhost/foo/bar/mygitrepodir
  2. github connection url (readonly) - git://github.com/mmiklavc/myproject.git
  3. github developer connection url (read/write) - git@github.com:mmiklavc/myproject.git

Performing the release prepare will do the following high-level steps:

  1. Change pom versions from X.X-SNAPSHOT to X.X
  2. Commit the new poms for the release to Git
  3. Tag the release commit in Git
  4. Increment poms to a new SNAPSHOT version, e.g. Update from X.0-SNAPSHOT to X.1-SNAPSHOT
  5. Commit the updated SNAPSHOT poms

See [Maven release prepare] (http://maven.apache.org/maven-release/maven-release-plugin/examples/prepare-release.html) documentation for more detail

About

Scalable Optical Character Recognition with Apache NiFi and Tesseract

License:Apache License 2.0


Languages

Language:Java 92.7%Language:Scala 5.8%Language:Python 1.5%