robertknight / tesseract-wasm

JS/WebAssembly build of the Tesseract OCR engine for use in browsers and Node

Home Page:https://robertknight.github.io/tesseract-wasm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sometimes getText() returns an empty string

fredkilbourn opened this issue · comments

@robertknight: Starting a new topic based on this #31 (comment) in issue #31:

... as well as one that works but returns an empty string from getText?

Looking into the issue further it seems like the same image will sometimes work and then sometimes return an empty string. I am running through a loop of hundreds of images, is it possible that I'm maybe not clearing the engine properly between iterations or some kind of internal state corruption is happening?

My very generalized pseudocode is this (based on your node CLI example):

import { createOCREngine } from "tesseract-wasm";
import { loadWasmBinary } from "tesseract-wasm/node";

const engine = await createOCREngine( { wasmBinary: await loadWasmBinary() } );
engine.loadModel( fs.readFileSync( `${__dirname}/../assets/tessdata_fast-4.1.0/eng.traineddata` ) );

for( const image of many_many_images )
{
    engine.loadImage( image );
    const text = engine.getText( logger );
}

And the precise problem is that text will sometimes be just a blank string even for an image that will otherwise work every other time I've re-run it.

Am I missing any major step here or could this be a bug?

is it possible that I'm maybe not clearing the engine properly between iterations or some kind of internal state corruption is happening?

It could be that something like this is happening. If you pick just one image of a "typical" size and just repeatedly process it in a loop, do you get the same result, that it eventually fails?

There is an engine.clearImage() method that you can call to free up resources within the library, so it would also be worth testing what happens if you call that at the end of each iteration.

Robert, I finally figured it out and I'm sorry to send you on this goose chase. I see now that errors thrown inside wasm are not actually thrown as catchable errors, but instead are logged to stderr (which in my environment wasn't actually logging to anything I could see). After investigating stderr I now see: Error in pixCreateNoInit: pixdata_malloc fail for data. I'll have to hook into that and respond to errors accordingly for now.

We can close this whole issue out.