nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory leak when using tess4j for parallel processing in docker environment

milen-dimitrov opened this issue · comments

I've encounter this memory leak a few weeks ago and I've managed to identify it only occurs when doing parallel OCR processing using tess4j within a docker container.

When running my container the java heap and native memory remain stable but the RAM usage by the container is increasing.

To reproduce this leak I'm iterating PDF files and for each PDF file I create 4-thread pool:
ExecutorService executor = Executors.newFixedThreadPool(4)

Each of the 4 threads is processing one page at a time.
For each page a Tesseract() instance is created and the tesseract.doOCR(pageImage) method is used to do the OCR.
When the processing of the PDF file finishes I close my thread pool using executor.shutdownNow()

I've managed to circumvent the leak if I make my thread pool static and I never shutdown my threads. I only reuse them.
This doesn't lead to an ever increasing RAM usage but I don't think recreating the thread pool and then shutting it down should be an issue.

If I run my code outside of the docker container, there is no memory leakage.
If I run my code in the container but using only one thread there is no memory leak either.

I made a git repository with a sample java project to illustrate and reproduce the leak. Just build and run the docker image:
https://github.com/milen-dimitrov/TessMemoryLeakSample

There are also these message that may mean something. I get them when I interrupt my program.
https://github.com/milen-dimitrov/TessMemoryLeakSample/blob/main/Screenshot_20230408_200926.png?raw=true

The memory leak issue has been reported several times, but we have no way to address it. The Java binding is just a thin Java layer over Tesseract C-API. The native code seems to tend to spring memory leaks in multithreaded applications.