shenfei / tassadar_ocr

An OCR service based on tesseract and thrift

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tassadar_ocr

Tassadar is an OCR service based on tesseract and thrift.

Usage

API

Tassadar provides the following ocr APIs:

  • get_ocr(1:binary image): accept image binary data, return ocr text result.
  • line_ocr(1:bianry image): do ocr line by line.
  • version(): return the current version.
  • cut_image(1:binary image, 2:i8 cut_type): segment image into components in different levels:
0: block,
1, paragraph,
2, textline,
3, word,
4, symbol,

The default level is 4.

Docker

The recommended way to use tassadar is through docker. You can either choose a pre-build image from docker hub: fshen/tassadar_ocr:latest, or build a new one with dockerfile.

Quick start:

docker pull fshen/tassadar_ocr:latest
docker tag fshen/tassadar_ocr:latest tassadar

# get the ocr result of $WORK_DIR/IMAGE_PATH
docker run -it -d --rm --name ocr -p 9090:9090 -v $WORK_DIR:/app tassadar /root/tassadar_ocr/tassadar_server
docker exec -it ocr python3 -m tassadar_client --input /app/IMAGE_PATH

Build from source

You can also build tassadar from source code. Here are some tips.

First, make sure all the following dependencies are installed:

  • tesseract >= 4.0
  • thrift >= 0.11
  • python3

Default tessdata language in tassadar is chi_sim+eng. If you want to change it, please follow the instruction in tessdata.

# server
git clone https://github.com/shenfei/tassadar_ocr.git
cd tassadar_ocr && make
./tassadar_server --port 9090

# client
pip3 install -e tassadar_ocr/python/
python3 -m tassadar_client -h

After installation, you can start a tassadar server and test ocr in python:

from tassadar_client import TassadarClient

client = TassadarClient(host='localhost', port=9090)
with open(image_path, 'rb') as fin:
    image = fin.read()
print(client.get_ocr(image))

Acknowledgment

The original tassadar project was developed during 2014 to 2016 in Uda Inc., a start-up which was closed in 2016.

I chose the name tassadar because it's similar to tessearct and all project names in Uda were picked from StarCraft at that moment.

Most of the outdated codes are removed, such as preprocessing via OpenCV, single character classification via Caffee, etc. But I still thank those original contributors:

About

An OCR service based on tesseract and thrift

License:Apache License 2.0


Languages

Language:Python 43.6%Language:C++ 37.1%Language:Dockerfile 12.3%Language:Makefile 5.8%Language:Thrift 1.2%