R0Wi-DEV / workflow_ocr

This is a Nextcloud Workflow App which enables you to process files via OCR on serverside.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Debian Buster: ocrmypdf outdated

joergmschulz opened this issue · comments

Possibly, this wonderful tool can't be used in Debian Buster. It uses ocrmypdf 8.0.1 which issues warnings like

WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR
This warning leaves the pdf alone / does not add the text layer.

The version 9.8.1 of Alpine works perfectly with the same input file.
Nextcloud log:

OCR for file /joerg.schulz/files/FDS Bau - Sanierung Haus Sonnenblick/Bauherr/Dokumentationen/Projektantrag-SoftwareAG2014.pdf not possible. Message: OCRmyPDF exited abnormally with exit-code 0. Message: WARNING - 4: [tesseract] lots of diacritics - possibly poor OCR WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR

Hi @joergmschulz, a shame that you're experiencing issues with OCRmyPDF. As a first information we'd like to mention that we are not willing to replace the tool under the hood. We tried some other tools and packages to achieve the same result earlier which leads to even more problems. So generally speaking we found that OCRmyPDF is the best tool to achieve exactly what we want to do with this app in combination with PDF files.

In your case i see the following options:

  1. Try to install a newer version of OCRmyPDF outside the regular package source. You could give a try on the python installation mentioned here for Ubuntu 18.04.
  2. In our app we could just ignore warnings in general. Even that quite dangerous in my opinion you could try if you're able to process the mentioned file "by hand" when invoking ocrmypdf inpup.pdf output.pdf on the commandline. If it just outputs some warnings but the output is generated properly we could think of a more fault tolerant handling inside the app. Please give us some feedback on this or attach the mentioned PDF file if this is possible.

@bahnwaerter anything to add on this?

Btw.: i'm also using Debian Buster with OCRmyPDF 8.0.1 installed and i did not see similar errors. So it also might be related to the PDF files you want to be processed.

attaching one file here // doesn't work, but: https://cloud.faudin.de/s/e2AbYxXR9njcZRC , password dddjjj

messages:

ocrmypdf --force-ocr /data/nc/joerg.schulz/files/Documents/Pferdezüchter\ Jens.pdf /tmp/ppp.pdf 
   INFO - Optimize ratio: 1.14 savings: 12.1%
   INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ ocrmypdf --version
8.0.1+dfsg
www-data@c:~$ ocrmypdf  /data/nc/joerg.schulz/files/Documents/Projektantrag_SoftwareAG_2014.pdf /tmp/ppp.pdf 
WARNING -    4: [tesseract] lots of diacritics - possibly poor OCR
WARNING -    2: [tesseract] lots of diacritics - possibly poor OCR
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ php -f /var/www/nc/cron.php

confirming: when I install the OCRmyPDF version as documented by @R0Wi above, all works perfectly. Maybe that should go into the README?

Glad to hear that everything works as expected now. I'll leave this open until we added the info to the README :-)

Thanks for the PR @joergmschulz !