techytushar / ocr-date-extractor

API to extract dates from documents using OCR

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OCR Date Extractor

Flask API to extract dates from documents

How to use

The API is provided with 2 routes:

  • If you want to pass Base64 encoded image, send a POST request with payload {"base_64_image_content": <base_64_image_bytes>} to
https://ocr-date-extractor.herokuapp.com/extract_date
  • If you want to pass image file, send a POST request with payload {'image': <image_file>} to
https://ocr-date-extractor.herokuapp.com/extract_date_from_image

Python sample code to test out the API:

  1. Sending the image as Base64 encoded
import requests, base64
img_url = <path_to_image>
with open(img_url, 'rb') as f:
    img = base64.b64encode(f.read())
response = requests.post('https://ocr-date-extractor.herokuapp.com/extract_date', data={'base_64_image_content':img})
print(response.content)
  1. Directly uploading the file
import requests
url = "https://ocr-date-extractor.herokuapp.com/extract_date_from_image"
files=[
  ('image',('document.png',open('/Users/tushar/peak/document.png','rb'),'image/png'))
]
response = requests.post(url, data=payload, files=files)
print(response.text)

Working

The project performs the following steps for any given image:

  • Re-scales the image if its too big in size
  • Performs thresholding to separate foreground (the document) and the background
  • Find contours and draws a bounding box on the document present in the image
  • Crops the image to keep only the document
  • Performs thresholding again to separate text from the background
  • Apply OCR to extract text
  • Use regex to extract out the date
  • Date is then parsed and returned in YYYY-MM-DD format

Supported Date Formats

Following date format are supported with some flexibility:

  • dd-mm-yyyy
  • mm-dd-yyyy
  • yyyy-mm-dd
  • dd/mm/yyyy
  • mm/dd/yyyy
  • yyyy/mm/dd
  • Aug23'19
  • Feb 24, 2019
  • 24 May'19

References

I took help from the following resources:

About

API to extract dates from documents using OCR


Languages

Language:Python 100.0%