OCR

This an end-to-end pipeline for the task of optical character recognition. Following this, we will be having a web application delivering the service. It can be viewed here. The models in this pipeline will be trained on ICDAR dataset and custom/private invoices.

Objective

Design an OCR from scratch. This will be further used to design web pages and potentially enhanced splitwise application.

Prerequisites

Docker / Microk8s installed [currently not needed, may skip].
MLFlow: hosted on Docker / Microk8s [currently not needed, may skip].
GCP account: enable cloud vision API and Document AI API and setup appropriate service accounts [needed for /src/pre execution].

Folder Structure

OCR Repo.

|- data
    |- raw (this folder holds raw information)
        |- test (holds the test set images and thier ground truth)
        |- train (holds the train and validation split along with their ground truth)
|- logs (any records such as training logs, graphs, etc goes here)
|- mlflow
    |- docker (contains scripts to run mlflow in docker)
        |- secrets (contians appropritate service accounts, envirnoment variables, etc)
    |- kubernetes (contains scripts to run mlflow in kubernetes - using kustomize)
        |- base (foundational scripts to deploy mlflow in kubernetes)
            |- local (local system)
            |- stage (remote server - changes in DB connections)
        |- overlays (you may override base files here)
            |- local (local system)
            |- stage (remote server - changes in DB connections)
        |- secrets (contanins secret files needed for the deployment)
            |- local (local system)
            |- secret (contains service acconts and other raw secrets)
            |- stage (remote server - changes in DB connections)
|- runs (pretrained models, saved/logged models goes here)
    |- models (holds logged models)
    |- pretrained-models (holds downloaded models)
|- nbs (contains notebook that demonstrates modules of OCR - text detection, text recognition and information extraction.)
|- src
    |- information-extraction (OCR pipeline - information extraction from the detected and recognized texts)
    |- pre (this folder containing scripts ought to be run before stepping into the OCR pipeline)
    |- text-detection (OCR pipeline - text detection in scene)
    |- text-recognition (OCR pipeline - text recognition in scene)
|- requirements.txt (pip requirements that must be installed prior to running this pipeline)

Tasks

Environment Variables

GOOGLE_APPLICATION_CREDENTIALS
HUGGINGFACEHUB_API_TOKEN
TOKENIZERS_PARALLELISM

About

OCR end-to-end pipeline

Languages

Language:Jupyter Notebook 98.3%Language:Python 1.7%Language:Dockerfile 0.0%