Optical character recognition (OCR) with help of R-tree

Optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.

Steps:

convert pdf to image, if multiple pdf pages then each page into the individual image file.
convert the colour image into a grayscale image
read/create target bounding boxes
with help of tesseract to recognize the character in the image
create an r-tree index for each bounding box of tesseract output data.
find the intersection of the target bounding box in an r-tree index.
get the required target index from the data frame, continue processing text if necessary.
repeat above step remaining pages.

Refrence:

ocr wiki
bbox stackoverflow

sudhirln92 / optical-character-recognition

Optical character recognition (OCR) with help of R-tree

About

Languages