Image text redaction

Evaluation of models for image text detection, recognition, and redaction. The evaluated models can detect horizontal print and handwritten text. Detection of curved text is beyond scope of these models although angled text in some orientations can be detected by one or mode or the models. The redaction module is a dummy placeholder that redacts any text that is recognized. It needs to be updated to selectively redact text based on use case.

Three models evaluated

easy ocr
TR OCR It is also available on Huggingface
Paddle OCR

For TR OCR, both handwritten and print text models are tested

Sample XRAY Image from Google Healthcare API page

Notes

Current bbox output of OCR models are coalesced to form bboxes around phrases (SnapToLineGrid.py). This is done for both text region detectors - easy ocr and paddle ocr. Additional info of bbox coalaescing around regions needs to be done to assist Deidentification. This will be optional information. Line based coalescing is requried for model that only do recognition (TROCR). They cant accept multiline regions.
Try other fully transformer based models that are emerging.
TROCR recognition is quite slow compared to PaddleOCR and EasyOCR. The latter two are on average in tens of milliseconds for an entire image whereas TROCR takes time in order of seconds for each text region within an image. For pure evaluation CPUs may suffice, given this caveat

License

This repository is covered by MIT license.

About

Prototype for image text detection, recognition, and redaction. The models used can detect horizontal print and handwritten text. It cannot detected slanted /curved text etc.

MIT License

Languages

Language:Python 98.3%Language:Shell 1.7%