Option to skip files which are already processes by some ORC scanner.

Question

Option to skip files which are already processes by some ORC scanner.

SynIV opened this issue 2 years ago · comments

Sometimes when I scan a document e.g. on my phone OCR is already done there in a pretty good quality. On the other hand when I scan files with my printer or I got some files from somewhere else which are not processed by OCR yet I like the option to automatically scan every file which is newly created on the server.

Therefore it would be absolutely great to automatically skip an OCR scan, if the file was already processed and contains printable text.
I would love the option to remove "--redo-ocr" to skip these documents without activating "--remove-background" because this has some other disadvantages according to the ocrmypdf documentation.

So I would like to ask very nicely if that would be possible. Unfortunately I am not experienced enough to contribute by myself.

Robin Windey · Answer 1 · Mon Apr 25 2022 12:57:34 GMT+0800 (China Standard Time)

HI @SynIV and thank's for your feature-request. First of all: of course this is possible with little effort but since we try to keep the app as simple as possible, i think we have to discuss the default behaviour a little bit.

According to the docs there are basically 3 flags for OCRmyPDF to make it skip some pages inside the PDF. The general target was to support "born digital" documents as well as scanned documents and mixed content, too. So rethinking this might lead to the conclusion that the default flag should be --skip-text instead of --redo-ocr so that pages that already contain text (regardless if it's a visible or invisible/OCR text layer) are skipped.

Could you please try to reproduce both of your use-cases via ocrymypdf command directly on CLI and give us some feedback if that fits your needs? So basically something like

ocrmypdf --skip-text input.pdf output.pdf

Kai · Answer 2 · Mon Apr 25 2022 15:08:56 GMT+0800 (China Standard Time)

Hi @R0Wi,

Thank you for your quick answer.

I have tested the --skip-text option on a only half scanned file and it works great. As described in the documentation the already scanned pages are skipped.

I think this would be a nice default behavior.

I understand that you try to keep the app as simple as possible but in my opinion it would make the app more individual and customizable if unseres could set options to rescan or skip pages with existing printable text.

For me I would be happy with --skip-text as the default behavior 😄

Robin Windey · Answer 3 · Mon Apr 25 2022 15:12:07 GMT+0800 (China Standard Time)

Thank's for your fast feedback. I will discuss this with @bahnwaerter and i think we can deliver a suitable solution in the next days. We will track our progress here 👍

doppelgrau · Answer 4 · Tue Jun 07 2022 17:25:22 GMT+0800 (China Standard Time)

Out of curiosity (I'd also like to avoid double OCR), is there a decision to change the default?

Robin Windey · Answer 5 · Tue Jun 07 2022 17:42:13 GMT+0800 (China Standard Time)

I think the advantage of using --redo-ocr is that there can also be pages with mixed content. For example a word document exported as PDF with some text and an image (containing text) on the same page would be processed without touching the visible text but processing the image on the page, adding a layer just over that image. In that situation --skip-text would just skip the whole page because it notices that there is already text on that page.

I think we can go that way:

Change the default behaviour to use --skip-text instead of --redo-ocr
New feature: add an option to configure the flag to be used inside the config UI. Make it exclusive when using --redo-ocr (disable "remove background" option then, see https://github.com/ocrmypdf/OCRmyPDF/blob/776ada671391a6282cdf397c78a3487fb1607059/src/ocrmypdf/_validation.py#L102)

@bahnwaerter any thoughts?

Manuel Bentele · Answer 6 · Wed Jun 15 2022 04:09:04 GMT+0800 (China Standard Time)

Thanks @SynIV for reporting this unfavorable behavior in your desired use case.

As @R0Wi already said, the --skip-text option skips all pages that contain text, regardless of the case of mixed content (text and images). This functionality is problematic if OCR has to be performed on images on such mixed content pages. Therefore, we decided to use the --redo-ocr option as the default instead.

To cover use cases described by @SynIV, we have to change the default option from --redo-ocr to --skip-text. Therefore, I agree with the proposed changes by @R0Wi. Please keep in mind @R0Wi, that this fundamental change is documented accordingly to prevent further issue and bug reports. From a performance perspective, changing the default option has the benefit of processing PDF files with a lot of mixed content much faster. I think most people will benefit from this effect, otherwise they have to use the new configuration option in the UI.

Kai · Answer 7 · Sat Jun 18 2022 17:37:08 GMT+0800 (China Standard Time)

Thank you so much! 😊

Robin Windey · Answer 8 · Sat Jun 18 2022 18:02:44 GMT+0800 (China Standard Time)

Thank you so much! 😊

Please let me know if you encounter any errors. Just pushed to the appstore for NC23 and NC24 🚀