thePortus / arabic-ocr

DH Demonstration using Python to OCR PDFs of Arabic Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

arabic-ocr

A Digital Humanities Exercise: Batch converting folders of Arabic PDFs to plaintext files.

David J. Thomas, thePort.us

Run In Your Broswer

This exercise uses a Jupyter Notebook to demonstrates converting Arabic PDFs to plain text files. Rather than install locally, see this notebook in action in your browser at MyBinder. Once the server is launched (it takes awhile, ignore any warnings), choose acabic-ocr.ipynb from the navigation menu on the left. To run code, click your mouse inside a cell of code, then click the play button to run that cell.

To Install Locally

Open your prompt, navigate to your desire parent folder and enter

git clone https://github.com/thePortus/arabic-ocr.git
cd arabic-ocr
pip install -r requirements.txt

You also need to install poppler, a software library that powers the PDF reader.

  • On OSX machines, open your terminal and enter brew install poppler
  • For Linux machines, open your terminal and enter sudo apt-get install poppler-utils.
  • Windows is a bit harder as poppler wasn't originally meant for it, but you can find a Windows downloadable installer here along with some discussion from a forum about how to set it up here.

To launch the notebook from your terminal just enter jupyter notebook

Once the interface launches in your webbrowser, select arabic-ocr.ipynb and get started.

About

DH Demonstration using Python to OCR PDFs of Arabic Text

License:MIT License


Languages

Language:Jupyter Notebook 100.0%