nayyhah/PDFAutomation-OCRTextRecognition

dell-hack2hire-hackathon image-alignment opencv-python pdf-to-image pytesseract-ocr remove-watermark text-extraction text-extraction-from-image

DECIPHER 2.0 | PDF Automation - OCR Text Recognition

Problem Statement

There are various validation rule engines across the industry. There are all Converting PDF documents into structured data which you can use in other systems is however not a trivial task and many businesses find themselves manually re-keying product data from PDF documents into spreadsheets or databases.

Extraction of critical data from those PDF’s(Invoice, receipts, sales orders) and in return receive structured data. Need to convert PDF to text for download

Assumptions/Callouts

Any data set can be used. Use supply chain related data
Pick few sample PDF's with watermark

Solution - DECIPHER 2.0

Decipher 2.0 is a user-friendly tool that identifies the data items one wants to extract from their uploaded invoice and gives back the desired fields as a downloadable file in multiple formats.It comes with the following built-in functionalities:

Extracts useful information from large, multiple PDFs in just seconds

Adjusts the alignment of randomly oriented pdf images

Remove watermarks to provide flawless results

Store the extracted information in cloud storage for future access

Architecture

Technology Stack

Frontend

React JS : web-app user interface
Netlify : deploying the web-app

Backend

Express.js : setting up backend server
Heroku : deploying backend services

Storage

MongoDB : storing schemas
Cloudinary : cloud storage for files

Tools and Languages

Tesseract : extracting text through OCR
Python : parsing pdf and images

Screenshots

Future Additions

Bounding Box Flexibility : Users can select their own bounding boxes and this feature can be implemented using JCrop Library

Language Translation : PDFs can be translated from one language to other using libraries like textblob or googletrans

QR Code Reader : Values of QR Codes or Bar Codes in PDFs can also be decoded and then stored using pyzbar library

Processing Handwritten Invoices : Text from handwritten invoices can be extracted and parsed. For this, Tensorflow library can be used to create and train a neural network by using datasets of extracted invoices.

About

PDF Automation - OCR Text Recognition

dell-hack2hire-hackathon image-alignment opencv-python pdf-to-image pytesseract-ocr remove-watermark text-extraction text-extraction-from-image

Languages

Language:JavaScript 71.0%Language:Python 16.7%Language:CSS 11.3%Language:HTML 0.9%