anushkaspatil / text-extractor

Extracting the text by processing the images and PDFs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text Extractor

Goal

The purpose of this project is to process images and PDF files to extract text using the OpenCV approach and Pytesseract exclusively. The code is added in the Jupyter notebook text_extraction.ipynb.

User Story

As a user, I should be able to easily process images and extract text from them using this tool.

Approach

Try 1:

Text extraction is achieved using Pytesseract and Tesseract-OCR, with the pdf2image library converting PDFs into images for processing. The function to extract text from images directly utilizes the image_to_string method of Pytesseract. For PDFs, a combination of OpenCV and Tesseract is employed for image preprocessing and text extraction.

In Try 1, there was an issue with the second image. This limitation is addressed in Try 2.

Try 2:

Try 2 focuses solely on text extraction but employs a different approach:

  • The image is split into three color channels (red, green, blue), and the image is visualized using Matplotlib to provide the user with a preview of the uploaded image or specified image path.
  • A configuration string is passed to Tesseract OCR when using the image_to_string function from pytesseract. The configuration includes parameters such as -l eng (language set to English), --oem 3 (OCR Engine Mode set to LSTM), and --psm 6 (Page Segmentation Mode set to sparse text with OSD).
  • The extracted text is printed.

Resources

  1. Tesseract Documentation
  2. Medium Blog: How to Use Tesseract Library for OCR in Google Colab Notebook
  3. Blog: OCR with Tesseract

License

This project is licensed under the MIT License.

About

Extracting the text by processing the images and PDFs

License:MIT License


Languages

Language:Jupyter Notebook 100.0%