This Python script extracts text from images using Tesseract OCR and organizes it into an Excel file.
- Automated Installation: Checks for required Python modules (
pytesseract
,openpyxl
,pandas
) and installs them if missing. - Text Extraction: Utilizes Tesseract OCR to extract text from images.
- Data Parsing: Parses extracted text to extract contact names and times seen, organizing them into an Excel file.
- Logging: Logs informative messages, warnings, and errors for better tracking and debugging.
- User Interaction: Prompts the user for image and output folder paths, allowing for interactive usage.
- Ensure Python is installed.
- Install Tesseract OCR:
- Windows:
- Download the installer from https://github.com/UB-Mannheim/tesseract/wiki.
- Run the installer and follow the installation instructions.
- Add the Tesseract installation directory to the system's PATH environment variable.
- click here to watch how install Tesseract Ocr for windows
- Linux:
- Use your package manager to install Tesseract OCR. For example, on Ubuntu:
sudo apt-get update sudo apt-get install tesseract-ocr
- Use your package manager to install Tesseract OCR. For example, on Ubuntu:
- macOS:
- Install Tesseract OCR using Homebrew:
brew install tesseract
- Install Tesseract OCR using Homebrew:
- Windows:
- Clone or download the repository.
- Place images to be processed in the
images
folder. - Run the script (
main.py
). - Follow the prompts to input image and output folder paths.
- View the generated Excel files in the
output
folder.
- Python 3.x
- Tesseract OCR
- Required Python modules:
pytesseract
,openpyxl
,pandas
This project is licensed under the MIT License.