In Tesseract latest Nuget 4.1.1 package we are facing performance issue while getting HOCR text from page.

Question

In Tesseract latest Nuget 4.1.1 package we are facing performance issue while getting HOCR text from page.

pjoshi90 opened this issue 2 years ago · comments

We have 20page tiff file.when we try to perform ocr operation for getting hocr text for page it will take around 4sec for each page which give us overall performance impact for performing ocr operation for multi-page file

Kees · Answer 1 · Sat Apr 23 2022 21:16:29 GMT+0800 (China Standard Time)

You could try https://github.com/Sicos1977/TesseractOCR that one is updated to the latest Tesseract version. Don't know if it makes any difference though. You probably need to rewrite some code (expect not much) because I changed some things.

pjoshi90 · Answer 2 · Mon Apr 25 2022 22:46:30 GMT+0800 (China Standard Time)

Thanks @Sicos19 its work as expected

You could try https://github.com/Sicos1977/TesseractOCR that one is updated to the latest Tesseract version. Don't know if it makes any difference though. You probably need to rewrite some code (expect not much) because I changed some things.