tagtog/java-ocr-amazon-textract-searchable-pdf

ocr ocr-recognition text-annotation text-annotation-tool tagtog amazon-textract nlp nlu nlp-machine-learning ai

This is a fully-functioning sample repository showing:

how to use an external OCR provider (in this case Amazon Textract).
upload the resulting PDFs into tagtog.

The code is written in Java (11).

This code starts from an Amazon Textract Tutorial (original code) to OCR input files (PDFs or images) and convert them into "searchable PDFs" (i.e. PDFs with embedded text). These "searchable PDFs" are exactly what we want to upload to tagtog to then annotate them using tagtog Native PDF.

This respository adds additional utilities (e.g. traversing & processing recursively given directories) and using the tagtog Documents APIs to upload the results to a given tagtog project. Http requests are done with java, Apache HttpClient (4.5).

The main entry point is DemoTagtogOcr.java. The main ingredients of the code are 3:

🧱 Compile

git clone https://github.com/tagtog/java-ocr-amazon-textract-searchable-pdf.git
cd java-ocr-amazon-textract-searchable-pdf/src/SearchablePDF/

./compile.sh

⚡️ Run

# Set your tagtog credentials
export TAGTOG_USERNAME=???
export TAGTOG_PASSWORD=???
# export TAGTOG_DOMAIN=??? # optionally, override the tagtog domain, for example if you are running tagtog OnPremises

time ./run.sh MY_TAGTOG_OWNERNAME MY_TAGTOG_PROJECT MY_TAGTOG_FOLDER ...inputFilesOrDirectories

🤓 Setup Amazon Textract

If you are new to AWS or unsure about the details, this is the complete AWS guide to get started with Amazon Textract.

In short, what you need is:

Make sure you have an IAM user with AmazonTextractFullAccess permissions & with an access key.
Configure your local aws credentials, with the [default] role pointing to that IAM user and also set your desired region.

🍃 Sample tagtog Project

Using this very same code, we OCR'ed the FUNSD dataset and uploaded the results into the tagtog public project: tagtog/FUNSD-OCRed 😃.

We exactly ran (last update on 2021-04-20):

time ./run.sh tagtog FUNSD-OCRed testing_data ~/Downloads/dataset/testing_data/  # took around ~2m; 50 docs in total
time ./run.sh tagtog FUNSD-OCRed training_data ~/Downloads/dataset/training_data/  # took around ~6m; 149 docs in total

These are some sample annotated documents in tagtog.

Notes

The original demo code tends to create oversized PDFs and to write the embedded character offsets a little bit below the actual (visual) positions. These details can be tweaked and of course depend on the used OCR software.

About

Sample java code, to OCR input files (with Amazon Textract) and upload the outputted PDFs to tagtog 🤘🚀.

https://www.tagtog.net/tagtog/FUNSD-OCRed/-settings

ocr ocr-recognition text-annotation text-annotation-tool tagtog amazon-textract nlp nlu nlp-machine-learning ai

Languages

Language:Java 99.8%Language:Shell 0.2%