Script to do OCR on pdf files. Can be used to periodically scan a directory for new scanned PDF files. As it is for now it does OCR for German language.
/var/log/ocr/
directory to store log output.
- pdfsandwich
- tesseract-ocr-deu
- awk
The script can be used to process all files in a directory or for a single file.
- List files ending with PDF or pdf which are owned by the user "scanner"
- Iterate over the files and do OCR
- Move processed files to the subdirectory 'processed' of the scanned directory
- Move the original file to the subdirectory 'archive' of the scanned directory