Tesseract working out only the first five pages of documents

Question

Tesseract working out only the first five pages of documents

santangelo opened this issue 5 years ago · comments

Hi,
first of all thank you for this great work you did.

I have set up Ambar to crawl a number of pdfs in an Archive folder with subfolders containing pdfs.
It seemed to me that it was doing great, but I have just noticed in the crawler log that Tesseract is processing only the first five pages of the documents (attached screenshot).

Of course the words from the 6th page on are not in the index.
Is there an option I did not know of?
Thanks a lot.
Michele

Michele · Answer 1 · Thu Aug 22 2019 21:12:28 GMT+0800 (China Standard Time)

In case it helps other, it was enough to set the env var "ocrPdfMaxPageCount" to desired number of pages on pipeline container. Thanks to the Ambar help.
Michele