foodoh/ocrd_menus

OCR'd menus

We have used tesseract as the OCR engine.

Further more we have divided the images to

dark colored	light colored

Which allows us to tweak us the OCR algorithm accordingly and help it perform better

The processed images are stored in tesseract_menu_data

Used selenium to automate the interaction with http://free-ocr.com It Has been giving better results than the tesseract

Note:

Requirements for that: $ pip install selenium

processed_files.sh: shows the ratio of menu images and the processed files in dir. (To keep track of things!)

Processed images stored in : menu_text (A total of 101 hotel menus were processed with each hotel having at least 4 menu images in them).

rmgarbage Implements the various rules presented in the paper Automatic Removal of “Garbage Strings” in OCR Text: An Implementation which helps us decide whether a string is a valid one or garbage.

OCR's text files for all the hotels in Bangalore. Tesseract OCR engine was used for the purpose

Language:Python 97.7%Language:Shell 2.3%