OCR
This an end-to-end pipeline for the task of optical character recognition. Following this, we will be having a web application delivering the service. It can be viewed here. The models in this pipeline will be trained on ICDAR dataset and custom/private invoices.
Objective
Design an OCR from scratch. This will be further used to design web pages and potentially enhanced splitwise application.
Prerequisites
- Docker / Microk8s installed [currently not needed, may skip].
- MLFlow: hosted on Docker / Microk8s [currently not needed, may skip].
- GCP account: enable cloud vision API and Document AI API and setup appropriate service accounts [needed for
/src/pre execution
].
Folder Structure
OCR Repo.
|- data
|- raw (this folder holds raw information)
|- test (holds the test set images and thier ground truth)
|- train (holds the train and validation split along with their ground truth)
|- logs (any records such as training logs, graphs, etc goes here)
|- mlflow
|- docker (contains scripts to run mlflow in docker)
|- secrets (contians appropritate service accounts, envirnoment variables, etc)
|- kubernetes (contains scripts to run mlflow in kubernetes - using kustomize)
|- base (foundational scripts to deploy mlflow in kubernetes)
|- local (local system)
|- stage (remote server - changes in DB connections)
|- overlays (you may override base files here)
|- local (local system)
|- stage (remote server - changes in DB connections)
|- secrets (contanins secret files needed for the deployment)
|- local (local system)
|- secret (contains service acconts and other raw secrets)
|- stage (remote server - changes in DB connections)
|- runs (pretrained models, saved/logged models goes here)
|- models (holds logged models)
|- pretrained-models (holds downloaded models)
|- nbs (contains notebook that demonstrates modules of OCR - text detection, text recognition and information extraction.)
|- src
|- information-extraction (OCR pipeline - information extraction from the detected and recognized texts)
|- pre (this folder containing scripts ought to be run before stepping into the OCR pipeline)
|- text-detection (OCR pipeline - text detection in scene)
|- text-recognition (OCR pipeline - text recognition in scene)
|- requirements.txt (pip requirements that must be installed prior to running this pipeline)
Tasks
- Data acquisition.
- Data Annotation/Preparation.
- Text Detection.
- Text Recognition.
- Information Extraction.
- Text Detection AI Pipeline.
- Text Recognition AI Pipeline.
- Information Extraction AI Pipeline.
- Integration of Blocks.
- Web Application.
Environment Variables
- GOOGLE_APPLICATION_CREDENTIALS
- HUGGINGFACEHUB_API_TOKEN
- TOKENIZERS_PARALLELISM