Debian Buster: ocrmypdf outdated

Question

Debian Buster: ocrmypdf outdated

joergmschulz opened this issue 4 years ago · comments

Possibly, this wonderful tool can't be used in Debian Buster. It uses ocrmypdf 8.0.1 which issues warnings like

WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR
This warning leaves the pdf alone / does not add the text layer.

The version 9.8.1 of Alpine works perfectly with the same input file.
Nextcloud log:

OCR for file /joerg.schulz/files/FDS Bau - Sanierung Haus Sonnenblick/Bauherr/Dokumentationen/Projektantrag-SoftwareAG2014.pdf not possible. Message: OCRmyPDF exited abnormally with exit-code 0. Message: WARNING - 4: [tesseract] lots of diacritics - possibly poor OCR WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR

joergmschulz commented 4 years ago

see #47

Robin Windey · Answer 1 · Thu Jan 28 2021 15:56:30 GMT+0800 (China Standard Time)

Hi @joergmschulz, a shame that you're experiencing issues with OCRmyPDF. As a first information we'd like to mention that we are not willing to replace the tool under the hood. We tried some other tools and packages to achieve the same result earlier which leads to even more problems. So generally speaking we found that OCRmyPDF is the best tool to achieve exactly what we want to do with this app in combination with PDF files.

In your case i see the following options:

Try to install a newer version of OCRmyPDF outside the regular package source. You could give a try on the python installation mentioned here for Ubuntu 18.04.
In our app we could just ignore warnings in general. Even that quite dangerous in my opinion you could try if you're able to process the mentioned file "by hand" when invoking ocrmypdf inpup.pdf output.pdf on the commandline. If it just outputs some warnings but the output is generated properly we could think of a more fault tolerant handling inside the app. Please give us some feedback on this or attach the mentioned PDF file if this is possible.

@bahnwaerter anything to add on this?

Btw.: i'm also using Debian Buster with OCRmyPDF 8.0.1 installed and i did not see similar errors. So it also might be related to the PDF files you want to be processed.

joergmschulz · Answer 2 · Thu Jan 28 2021 17:37:21 GMT+0800 (China Standard Time)

attaching one file here // doesn't work, but: https://cloud.faudin.de/s/e2AbYxXR9njcZRC , password dddjjj

messages:

ocrmypdf --force-ocr /data/nc/joerg.schulz/files/Documents/Pferdezüchter\ Jens.pdf /tmp/ppp.pdf 
   INFO - Optimize ratio: 1.14 savings: 12.1%
   INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ ocrmypdf --version
8.0.1+dfsg
www-data@c:~$ ocrmypdf  /data/nc/joerg.schulz/files/Documents/Projektantrag_SoftwareAG_2014.pdf /tmp/ppp.pdf 
WARNING -    4: [tesseract] lots of diacritics - possibly poor OCR
WARNING -    2: [tesseract] lots of diacritics - possibly poor OCR
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ php -f /var/www/nc/cron.php

joergmschulz · Answer 3 · Thu Jan 28 2021 20:22:15 GMT+0800 (China Standard Time)

confirming: when I install the OCRmyPDF version as documented by @R0Wi above, all works perfectly. Maybe that should go into the README?

Robin Windey · Answer 4 · Thu Jan 28 2021 22:07:15 GMT+0800 (China Standard Time)

Glad to hear that everything works as expected now. I'll leave this open until we added the info to the README :-)

Robin Windey · Answer 5 · Fri Jan 29 2021 00:17:49 GMT+0800 (China Standard Time)

Thanks for the PR @joergmschulz !