Debian Buster: ocrmypdf outdated
joergmschulz opened this issue · comments
Possibly, this wonderful tool can't be used in Debian Buster. It uses ocrmypdf 8.0.1 which issues warnings like
WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR
This warning leaves the pdf alone / does not add the text layer.
The version 9.8.1 of Alpine works perfectly with the same input file.
Nextcloud log:
OCR for file /joerg.schulz/files/FDS Bau - Sanierung Haus Sonnenblick/Bauherr/Dokumentationen/Projektantrag-SoftwareAG2014.pdf not possible. Message: OCRmyPDF exited abnormally with exit-code 0. Message: WARNING - 4: [tesseract] lots of diacritics - possibly poor OCR WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR
Hi @joergmschulz, a shame that you're experiencing issues with OCRmyPDF
. As a first information we'd like to mention that we are not willing to replace the tool under the hood. We tried some other tools and packages to achieve the same result earlier which leads to even more problems. So generally speaking we found that OCRmyPDF
is the best tool to achieve exactly what we want to do with this app in combination with PDF files.
In your case i see the following options:
- Try to install a newer version of
OCRmyPDF
outside the regular package source. You could give a try on the python installation mentioned here for Ubuntu 18.04. - In our app we could just ignore warnings in general. Even that quite dangerous in my opinion you could try if you're able to process the mentioned file "by hand" when invoking
ocrmypdf inpup.pdf output.pdf
on the commandline. If it just outputs some warnings but the output is generated properly we could think of a more fault tolerant handling inside the app. Please give us some feedback on this or attach the mentioned PDF file if this is possible.
@bahnwaerter anything to add on this?
Btw.: i'm also using Debian Buster with OCRmyPDF
8.0.1 installed and i did not see similar errors. So it also might be related to the PDF files you want to be processed.
attaching one file here // doesn't work, but: https://cloud.faudin.de/s/e2AbYxXR9njcZRC , password dddjjj
messages:
ocrmypdf --force-ocr /data/nc/joerg.schulz/files/Documents/Pferdezüchter\ Jens.pdf /tmp/ppp.pdf
INFO - Optimize ratio: 1.14 savings: 12.1%
INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ ocrmypdf --version
8.0.1+dfsg
www-data@c:~$ ocrmypdf /data/nc/joerg.schulz/files/Documents/Projektantrag_SoftwareAG_2014.pdf /tmp/ppp.pdf
WARNING - 4: [tesseract] lots of diacritics - possibly poor OCR
WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR
INFO - Optimize ratio: 1.00 savings: 0.0%
INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ php -f /var/www/nc/cron.php
confirming: when I install the OCRmyPDF version as documented by @R0Wi above, all works perfectly. Maybe that should go into the README?
Glad to hear that everything works as expected now. I'll leave this open until we added the info to the README :-)
see #47
Thanks for the PR @joergmschulz !