nautilus-ocr

nautilus-ocr is a Nautilus script that scans pdfs and adds OCR information to them. This is useful if you scanned documents and want to make them searcheable or copy-paste content from them.

It allows selecting the language in the PDF for a better recognition of the text.

Requirements

Zenity: UI dialogs
OCRMyPDF: OCR
Tesseract: OCR (low-level)

For Tesseract you need also the OSD data.

You can install all the dependencies in Fedora as follows:

$ dnf install tesseract tesseract-osd ocrmypdf zenity

In Ubuntu you would instead do:

$ sudo apt install tesseract-ocr tesseract-ocr-osd ocrmypdf zenity

Similar packages might exist for your distribution of choice.

You might also want to add training data for other languages. In Fedora you can install, for example, French language as follows:

$ dnf install tesseract-langpack-fra

In Ubuntu, it would be:

$ sudo apt install tesseract-ocr-fra

Installation

(Click on image to see screencast)

Download nautilus-ocr.sh and move it to the Nautilus Scripts folder. Remember to add the execution permission. If you downloaded the script to ~/Downloads folder:

$ chmod +x ~/Downloads/nautilus-ocr.sh
$ mv ~/Downloads/nautilus-ocr.sh "~/.local/share/nautilus/scripts/Create OCR'ed PDF"

Once the script is in place, you can righ-click on any file in Nautilus.

It will create a pdf file with the _ocr suffix. That file will contain OCR information. If you can select text in that file, it means the OCR worked.

Configuration

Language

By default a dialog will ask you which language to use for the OCR. If you want to use always the same language, edit nautilus-ocr.sh and set the language at the top of the file.

Close on finish

A dialog will be showed after OCR'ing the files, stating that the process has finished. If you want to close the dialog automatically, just uncomment the option --auto-close.

Contributors

@errotu: Ubuntu support

daniperez / nautilus-ocr