g-schmitz / workflow_ocr

This is a Nextcloud Workflow App which enables you to process files via OCR on serverside.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nextcloud Workflow OCR app

PHPUnit codecov Lint Generic badge Generic badge

Table of contents

Setup

App installation

First download and install the Nextcloud Workflow OCR app from the official Nexcloud-appstore or by downloading the appropriate tarball from the releases page.

cd /var/www/<NEXTCLOUD_INSTALL>/apps
wget https://github.com/R0Wi/workflow_ocr/releases/download/<VERSION>/workflow_ocr.tar.gz
tar -xzvf workflow_ocr.tar.gz
rm workflow_ocr.tar.gz

Nextcloud background jobs

Since the actual processing of the files is done asynchronously via Nextcloud's background job engine, make sure you've properly setup the cron functionallity as described here. If possible please use the crontab approach for more reliability.

Backend

⚠️ Since v1.20.1 you'll have to install OCRmyPDF.

In the backend OCRmyPDF is used for processing PDF files. Make sure you have this commandline tool installed. Make sure you have the appropriate version (see below, Used libraries').

apt-get install ocrmypdf

The ocrmypdf CLI can also convert single image files (jpg/png) to PDF before processing it via OCR. This mode is also supported by this app. You can read more about it in the official docs.

Also if you want to use specific language settings please install the corresponding tesseract packages.

# English
apt-get install tesseract-ocr-eng

# German
apt-get install tesseract-ocr-deu

# Chinese - Simplified
apt-get install tesseract-ocr-chi-sim

Usage

You can configure the OCR processing via Nextcloud's workflow engine. Therefore configure a new flow via SettingsFlowAdd new flow (if you don't see OCR file here the app isn't installed properly or you forgot to activate it).

Usage setup

Useful triggers

Trigger OCR if file was created or updated

If you want a newly uploaded file to be processed via OCR or if you want to process a file which was updated, use the When-conditions File created or File updated or both.

A typical setup for processing incoming PDF-files and adding a text-layer to them might look like this:

PDF setup

⚠️ Please ensure to use the File MIME typeisPDF documents operator, otherwise you might not be able to save the workflow like discussed here.

Trigger OCR on tag assigning

If you have existing files which you want to process after they have been created, or if you want to filter manually which files are processed, you can use the Tag assigned event to trigger the OCR process if a user adds a specific tag to a file. Such a setup might look like this:

Tag assigned setup

After that you should be able to add a file to the OCR processing queue by assigning the configured tag to a file:

Tag assign frontend 1

Tag assign frontend 2

Settings

Per workflow settings

Anyone who can create new workflows (admin or regular user) can configure settings for the OCR processing for a specific workflow. These settings are only applied to the specific workflow and do not affect other workflows.

Per workflow settings

Currently the following settings are available per workflow:

Name Description
Languages The languages to be used for OCR processing. The languages can be choosen from a dropdown list. For PDF files this setting corresponds to the -l parameter of ocrmypdf. Please note that you'll have to install the appropriate languages like described in the ocrmypdf documentation.
Remove background If the switch is set, the OCR processor will try to remove the background of the document before processing and instead set a white background. For PDF files this setting corresponds to the --remove-background parameter of ocrmypdf.

Global settings

As a Nextcloud administrator you're able to configure global settings which apply to all configured OCR-workflows on the current system. Go to SettingsFlow and scroll down to Workflow OCR:

Global settings

Currently the following settings can be applied globally:

Name Description
Processor cores Defines the number of processor cores to use for OCR processing. When the input is a PDF file, this corresponds to the ocrmypdf CPU limit. This setting can be especially useful if you have a small backend system which has only limited power.

Testing your configuration

To test if your file gets processed properly you can do the following steps:

  1. Upload a new file which meets the criteria you've recently defined in the workflow creation.
  2. Go to your servers console and change into the Nextcloud installation directory (e.g. cd /var/www/html/nextcloud).
  3. Execute the cronjob file manually e.g. by typing sudo -u www-data php cron.php (this is the command you usually setup to be executed by linux crontab).
  4. If everything went fine you should see that there was a new version of your file created. If you uploaded a PDF file you should now be able to select text in it if it contained at least one image with scanned text.

File versions

How it works

General

General diagramm

PDF

For processing PDF files, the external command line tool OCRmyPDF is used. The tool is always invoked with the --skip-text parameter so that it will skip pages which already contain text. Please note that with that parameter set, it's currently not possible to analize pages with mixed content (see R0Wi-DEV#113 for furhter information).

Images

For processing single images (currently jpg and png are supported), ocrmypdf converts the image to a PDF. The converted PDF file will then be OCR processed and saved as a new file with the original filename and the extension .pdf (for example myImage.jpg will be saved to myImage.jpg.pdf). The original image fill will remain untouched.

Development

Dev setup

Tools and packages you need for development:

You can then build and install the app by cloning this repository into the Nextcloud apps folder and running make build.

cd /var/www/<NEXTCLOUD_INSTALL>/apps
git clone https://github.com/R0Wi/workflow_ocr.git workflow_ocr
cd workflow_ocr
make build

Don't forget to activate the app via Nextcloud web-gui.

Debugging

We provide a preconfigured debug configuration file for VSCode at .vscode/launch.json which will automatically be recognized when opening this repository inside of VSCode. If you've properly installed and configured the XDebug-plugin you should be able to see it in the upper left corner when being inside of the debug-tab.

VSCode debug profile

To get the debugger profiles working you need to ensure that XDebug for Apache (or your preferred webserver) and XDebug for PHP CLI both connect to your machine at port 9003. Depending on your system a possible configuration could look like this:

; /etc/php/7.4/cli/php.ini
; ...
[Xdebug]
zend_extension=/usr/lib/php/20190902/xdebug.so
xdebug.remote_enable=1
xdebug.remote_host=127.0.0.1
xdebug.remote_port=9003
xdebug.remote_autostart=1
; /etc/php/7.4/apache2/php.ini
; ...
[Xdebug]
zend_extension=/usr/lib/php/20190902/xdebug.so
xdebug.remote_enable=1
xdebug.remote_host=127.0.0.1
xdebug.remote_port=9003
xdebug.remote_autostart=1

The following table lists the various debug profiles:

Profile name Use
Listen for XDebug Starts XDebug listener for your webserver process.
Listen for XDebug (CLI) Starts XDebug listener for your php cli process.
Run cron.php Runs Nextcloud's cron.php with debugger attached. Useful for debugging OCR-processing jobs.
Debug Unittests Start PHPUnit Unittests with debugger attached.
Debug Integrationtests Start PHPUnit Integrationtests with debugger attached.

If you're looking for some good sources on how to setup VSCode + XDebug we can recommend:

docker-based setup

If you're interested in a docker-based setup we can recommend the images from https://github.com/thecodingmachine/docker-images-php which already come with Apache and XDebug installed.

A working docker-compose.yml-file could look like this:

version: '3'
services:
  apache_dev:
    restart: always
    container_name: apache_dev
    image: ${IMAGE}-custom
    build:
      dockerfile: ./Dockerfile
      args:
        IMAGE: ${IMAGE}
    environment:
      - PHP_INI_MEMORY_LIMIT=1g
      - PHP_INI_ERROR_REPORTING=E_ALL
      - PHP_INI_XDEBUG__START_WITH_REQUEST=yes
      - PHP_INI_XDEBUG__LOG_LEVEL=7
      - PHP_EXTENSIONS=xdebug gd intl bcmath gmp imagick
    volumes:
      - ./html:/var/www/html
      - ./000-default.conf:/etc/apache2/sites-enabled/000-default.conf
    ports:
      - 80:80
    networks:
      - web_dev

IMAGE could be set to IMAGE=thecodingmachine/php:7.4-v4-apache-node14 and the content of Dockerfile might look like this:

ARG IMAGE
FROM $IMAGE

USER root
RUN    apt-get update \
    && apt-get install -y make ocrmypdf tesseract-ocr-eng tesseract-ocr-deu smbclient \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* /usr/share/doc/*
USER docker

ℹ️ Please note that these are just working snippets which you might have to modify to fit your needs.

Executing tests

To execute the implemented PHPUnit tests you can use one of the following commands:

# Only run unittests
make unittest

# Only run integrationtests
make integrationtest

# Run all tests
make test

# Run all tests and create HTML coverage report
make html-coverage

# Run all tests and create XML coverage report
make coverage

⚠️ Make sure you activated the app before you run any tests (php occ app:enable workflow_ocr). Otherwise the initialization will fail.

Adding a new OcrProcessor

To support a new mimetype for being processed with OCR you have to follow a few easy steps:

  1. Create a new class in lib/OcrProcessors and let the class implement the interface IOcrProcessor.
  2. Register your implementation in lib/OcrProcessors/OcrProcessorFactory.php by adding it to the mapping.
private static $mapping = [
        'application/pdf' => PdfOcrProcessor::class,
		// Add your class here, for example:
		'mymimetype' => MyOcrProcessor::class
    ];
  1. Register a factory for creating your newly added processor in lib/OcrProcessors/OcrProcessorFactory.php by adding an appropriate function inside of registerOcrProcessors.
public static function registerOcrProcessors(IRegistrationContext $context) : void {
		// [...]
		$context->registerService(MyOcrProcessor::class, function(ContainerInterface $c) {
			return new /* your factory goes here */
		}, false);
	}

That's all. If you now create a new workflow based on your added mimetype, your implementation should be triggered by the app. The return value of ocrFile(string $fileContent, WorkflowSettings $settings, GlobalSettings $globalSettings) will be interpreted as the file content of the scanned file. This one is used to create a new file version in Nextcloud.

Limitations

  • Currently only pdf documents (application/pdf) and single images (image/jpeg and image/png) can be used as input. Other mimetypes are currently ignored but might be added in the future.

  • All input file types currently produce a single pdf output file. Currently there is no other output file format supported.

  • Pdf metadata (like author, comments, ...) might not be available in the converted output pdf document. This is limited by the capabilities of ocrmypdf (see ocrmypdf/OCRmyPDF#327).

  • Currently files are only processed based on workflow-events so there is no batch-mechanism for applying OCR to already existing files. This is a feature which might be added in the future. For applying OCR to a single file, which already exist, one could use the "tag assigned" workflow trigger.

  • If you encounter any problems with the OCR processing, you can always restore the original file via Nextcloud's version history.

    File versions

    If you want to clean the files history for all files and only preserve the newest file version, you can use
    sudo -u www-data php occ versions:cleanup

    Read more about this in the docs.

Used libraries & components

Name Version Link
OCRmyPDF (commandline) >= 9.6.0 https://github.com/jbarlow83/OCRmyPDF On Debian, you might need to manually install a more recent version as described in https://ocrmypdf.readthedocs.io/en/latest/installation.html#ubuntu-18-04-lts; see R0Wi-DEV#46
php-shellcommand >= 1.6 https://github.com/mikehaertl/php-shellcommand
chain >= 0.9.0 https://packagist.org/packages/cocur/chain
PHPUnit >= 8.0 https://phpunit.de/

About

This is a Nextcloud Workflow App which enables you to process files via OCR on serverside.

License:GNU Affero General Public License v3.0


Languages

Language:PHP 85.6%Language:JavaScript 6.4%Language:Makefile 4.3%Language:Vue 3.7%