foodoh / ocrd_menus

OCR's text files for all the hotels in Bangalore. Tesseract OCR engine was used for the purpose

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OCR'd menus

Approach1: Using Tesseract OCR engine

We have used tesseract as the OCR engine.

Further more we have divided the images to

dark colored light colored

Which allows us to tweak us the OCR algorithm accordingly and help it perform better

The processed images are stored in tesseract_menu_data

Approach2: free-ocr.com

Used selenium to automate the interaction with http://free-ocr.com It Has been giving better results than the tesseract

Note:

Requirements for that: $ pip install selenium

  • Implemented in free_ocr_selenium.py

processed_files.sh: shows the ratio of menu images and the processed files in dir. (To keep track of things!)

Processed images stored in : menu_text (A total of 101 hotel menus were processed with each hotel having at least 4 menu images in them).

Packages inside

rmgarbage Implements the various rules presented in the paper Automatic Removal of “Garbage Strings” in OCR Text: An Implementation which helps us decide whether a string is a valid one or garbage.

About

OCR's text files for all the hotels in Bangalore. Tesseract OCR engine was used for the purpose


Languages

Language:Python 97.7%Language:Shell 2.3%