documentscanner

documentscanner allows you to transform (almost) any ADF scanner into a document scanner that produces OCRed PDFs. All you need is

a sane-compatible ADF scanner
a raspberry pi
(optional) a more powerful host to run the OCR tasks

Setup instructions

Check out documentscanner onto a raspberry pi: $ git checkout https://github.com/BastianPoe/documentscanner.git ; cd documentscanner
Install sane and other dependencies$ apt-get install sane sane-utils bash unpaper tesseract-ocr tesseract-ocr-deu imagemagick bc poppler-utils findutils scanbd
Install scanbd script: $ mkdir -p /etc/scanbd/scripts ; cp scanbd/test.script /etc/scanbd/scripts/
Enable scanbd: $ systemctl enable scanbd
Restart scanbd: $ systemctl restart scanbd
Create inbox and outbox: $ mkdir -p /inbox /outbox
Start document processor: $ cd scripts ; ./process.sh /inbox /outbox
Done

What if it does not work

Check if sane recognizes your scanner via $ scanimage -L
Check the logs of scanbd via $ journalctl -f. You should be seeing log outputs whenever you press a button
Modify the events scanbd triggers for in /etc/scanbd/scripts/test.script (currently: scan and email)
Check if scanned raw documents end up in /inbox
Check logfiles of the processor
Check if PDFs end up in /outbox

How it works

Scanning

documentscanner uses scanbd to wait for someone to press a button on the scanner. This triggers the script in /etc/scanbd/scripts/test.script which differentiates which button has been pressed. The script calls /home/pi/documentscanner/scripts/scan.sh and scans all pages available into a folder in /inbox. After completing the scan, a file called complete is placed in the scan directory.

PDF conversion

The processor checks every 10s in /inbox and if there is a new document with the complete flag, the document is processed. Initially, we use identify with a heuristic to identify and remove empty pages. Then, each page is processed using unpaper to remove the background, etc. Subsequently, the pages are OCRed using tesseract and converted to PDFs. Finally, the individual PDFs are joined into one using pdfunite and the scan directory is deleted.

Maintenance required

Incomplete scans (e.g. those where the ADF pulled multiple pages at once) are aborted and never receive the complete flag and hence are not processed by the processor. Check /inbox from time to time to see, which documents have ended up there and delete them.

(Optional) Speed up PDF generation

I run the processor in a Docker container on my Synology NAS. This is way faster than on the raspberry and does not slow down subsequent scans. The required setup steps are quite easy:

Create a new shared directory on your NAS and expose it via NFS to your raspberry pi
Install autofs: $ apt-get install autofs
Add NFS mounting to /etc/auto.misc: documentarchive -rw,soft,intr,rsize=8192,wsize=8192 192.168.1.26:/volume1/documentarchive
Enable auto.misc by adding the following line to /etc/auto.master: /misc /etc/auto.misc
Edit your /etc/scanbd/scripts/test.script to place scans into your output folder. E.g. FOLDER="/misc/documentarchive/scans_raw
Pull bastianpoe/document_archive into the Docker Station on your NAS
Map /inbox onto the NFS share created above and /outbox onto where the PDFs shall be stored
Start the docker container
Done

BastianPoe / documentscanner