robertknight / tesseract-wasm

JS/WebAssembly build of the Tesseract OCR engine for use in browsers and Node

Home Page:https://robertknight.github.io/tesseract-wasm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large images can cause WASM out-of-memory errors

robertknight opened this issue · comments

Large images can cause a memory allocation failure when loaded into Tesseract. The image size threshold for triggering this is lower if a large image is already loaded into Tesseract.

Images taken on my iPhone X+ are 3024x4032 at their native resolution. With the current 128MB memory cap they will load into the WebAssembly memory when first dropped into the demo app, but another image of a similar size a second time will trigger an error.

Some things that can be done:

  • Raise the WASM memory cap from 128MB to a higher value
  • Convert the image to 8-bit greyscale before loading into Tesseract
  • Resize images to a certain max size before loading into Tesseract
  • Specify a maximum image size in the library
  • Improve error handling for out-of-memory situations so that a useful error is at least reported

Images taken on my iPhone X+ are 3024x4032 at their native resolution. With the current 128MB memory cap they will load into the WebAssembly memory when first dropped into the demo app, but another image of a similar size a second time will trigger an error.

Improvements in #25 and #32 improved the situation by reducing the number of image copies in the WASM memory, and thus peak memory usage. I can no longer reproduce the issue using images of this size.

Even larger images could potentially still trigger the issue though.

Convert the image to 8-bit greyscale before loading into Tesseract

I tried this. For a 3024x4032 image (default from a current iPhone), allocating an 8-bit greyscale image with Leptonica and copying data from the ImageData into the Leptonica image in JS reduced peak memory usage in the WASM memory by (very roughly) 50%. The amount of data per image copy in this case is reduced from 47MB (for a 32-bit image) to 12MB. Some implementation notes:

  1. The byte order (in each 32-bit word) needs to be swapped when copying from ImageData => Leptonica image
  2. Leptonica image lines are padded to a multiple of 4 bytes, whereas ImageData does not use padding
  3. Leptonica supports 1-bit images, so memory usage could be reduced even further if the input were binarized before being passed to Tesseract
  4. Tesseract retains a copy of the input image and, once the image is binarized, a copy of the binarized image as well

@robertknight I have 5072 x 8416 images and they are hitting a new exception tied to this:

RangeError: offset is out of bounds
    at Uint32Array.set (<anonymous>)
    at OCREngine.loadImage (file:///C:/UserLocal/git/Budros_Scraper/node_modules/tesseract-wasm/dist/lib.js:662:24)

https://github.com/robertknight/tesseract-wasm/blob/b73f68f/src/ocr-engine.ts#L169

A 32-bit image of that size requires ~162MB of space and this library is currently configured to allow a maximum memory size of 128MB.

Some workarounds are to use a canvas to capture a resized or partial view of the image as an ImageBitmap and pass the result to this library. See

export function imageDataFromBitmap(bitmap: ImageBitmap): ImageData {
for code that can be copied and adapted.

I should note that internally Tesseract resizes detected lines of text to a fixed height before performing recognition on them, so once the input image is high-enough resolution that the text can be easily "read", there isn't any benefit to using a higher resolution AFAIK.

I appreciate the info, I switched to this from tesseract.js for the SIMD speed improvements, but that library never errored out on any size image which was nice. I've resorted to manually sizing down images that are too large, but my project is related to automated scraping of information so I don't have full control over the original size of the image.

I don't have all the answers on how you might make your library more flexible in the long-term but here's a couple of tidbits:

Also, another concern I ran into: In some cases the image is just large enough to not throw an error, but will return an empty string from .getText() instead of producing any kind of error. I'm okay handling if an error throws but with this I have to try and guess an arbitrary pixel limit not to exceed. I don't know if this is a code issue or one with my environment but it's more difficult to address.

Are you able to provide an example of one of the larger images that causes this error, as well as one that works but returns an empty string from getText?

It is definitely possible to raise the memory limit. I will check what the downsides are of doing so and as long as they won't cause a major problem I can do so.

Per #69 (comment) I see that my entire problem here is that I wasn't watching stderr. Please disregard and my only feedback would be that it would be good if an error could be thrown instead of only logging to stderr. Or, make it more clear that this is the behavior. Additionally, raising the memory limit as discussed would help in general.

Thanks and sorry for wasting your time with the blank return rabbit-hole.

my only feedback would be that it would be good if an error could be thrown instead of only logging to stderr.

I agree. Unfortunately the underlying Tesseract library is not always good about returning errors to its caller which makes reliably reporting errors difficult in some cases. In this case the warning is coming from somewhere in its C++ code.

One other thing I'll mention related to memory usage - the OCREngine class has a destroy() method which you can call to explicitly release the entire engine instance. This frees up more resources than clearImage, but you cannot use the OCREngine instance after it is called and have to create a new one for any subsequent images.

No, only the two changes mentioned in the changelog.

Thanks!