nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add Support for OSD to the ITesseract interface via a new Recognize method option OR provide a new Interface that allows for OSD Recognition

swamyrajamohan opened this issue · comments

OSD Recognition provides a way for automatic language detection support for Tesseract (Tess4J).

Currently Tesseract requires the language to be provide at the time of extraction. Often times there are use cases where the language of the image (text content) is not known in advance. In this case. OSD allows for Detecting the Script (Script of the Language) and the orientation of the image itself. Coupling this with a simple extraction using the script allows for a coarse grained extraction that is sufficient for language detection (using an open source library like lingua to be applied to that image page and the subsequent full processing maybe effected using the detected language with the doOCR. While this seems to be an involved approach. The process of OSD and coarse grained extraction using the script ( for eg. script/Latin, script/Devanagari) takes a fraction of the time required for full on extraction using the actual language.

Hence requesting providing a good High level wrapper similar to the doOCR method of ITesseract for recognize ( OSD Detection) as well.

Thanks
Swami

Currently there is a way to do OSD with Tess4J but it is quite involved and involves working wit the JNI wrapper directly as outlined in this unit test.
https://github.com/nguyenq/tess4j/blob/master/src/test/java/net/sourceforge/tess4j/TessAPITest.java#L474

Would TessBaseAPIDetectOrientationScript suit you better? It's simpler to call.

https://github.com/nguyenq/tess4j/blob/master/src/test/java/net/sourceforge/tess4j/TessAPITest.java#L440

As I had indicated both the approaches in Line 474 (https://github.com/nguyenq/tess4j/blob/master/src/test/java/net/sourceforge/tess4j/TessAPITest.java#L474) and Line 440 are (https://github.com/nguyenq/tess4j/blob/master/src/test/java/net/sourceforge/tess4j/TessAPITest.java#L440) are both involved and require working wit JNI directly. It would be great to add support similar to doOCR in the ITesseract interface or better to include another interface that extends the ITesseract interface providing a recognize method that abstracts away the quirks of the JNI itself. It also will mean the image reading is abstracted away. PyTesseract provides such an abstraction for OSD with the OSD output in a nice object.

Thanks for responding quite very quickly @nguyenq

Implemented in commit d5621e5.

@swamyrajamohan Please check out the latest and help test.