UNITS

This is a official implementation of the following paper

Towards Unified Scene Text Spotting based on Sequence Generation, CVPR 2023.
Taeho Kil, Seonghyeon Kim, Sukmin Seo, Yoonsik Kim, Daehee Kim

Introduction

Sequence generation models have recently made significant progress in unifying various vision tasks. Although some auto-regressive models have demonstrated promising results in end-to-end text spotting, they use specific detection formats while ignoring various text shapes and are limited in the maximum number of text instances that can be detected. To overcome these limitations, we propose a UNIfied scene Text Spotter, called UNITS. Our model unifies various detection formats, including quadrilaterals and polygons, allowing it to detect text in arbitrary shapes. Additionally, we apply starting-point prompting to enable the model to extract texts from an arbitrary starting point, thereby extracting more texts beyond the number of instances it was trained on. Experimental results demonstrate that our method achieves competitive performance compared to state-of-the-art methods. Further analysis shows that UNITS can extract a larger number of texts than it was trained on.

Updates

2023-07-17 First Commit, We release our code, data annotations, and model weights.

Getting Started

Installation

Our model implementation is based on timm.

To begin, please install the necessary dependencies by running the following command:

pip install -r requirements.txt

Preparing Datasets

LMDB

Before proceeding, make sure to download the following datasets according to the instructions provided by AdelaiDet: README.md. The datasets you need to download are TotalText, CTW1500, ICDAR2013, ICDAR2015, and CurvedSynText150k.

Additionally, download the MLT2019, TextOCR, and HierText datasets from the following sources: MLT2019, TextOCR, HierText.

Furthermore, ensure that all dataset annotations are converted to the lmdb format. We have provided the source code to convert the annotations from their original format to the lmdb format used in UNITS. Execute the following commands:

python script make_lmdb.py --split train
python script make_lmdb.py --split val
python script make_lmdb.py --split test

Additionally, you can download the converted lmdb files from the following path: lmdb.

Structure

This repository assumes the following structure for the datasets:

train_datasets
├── images
│   ├── hiertext
│   ├── textocr
│   ├── synthtext150k
│   ├── ctw1500
│   ├── totaltext
│   ├── icdar13
│   ├── icdar15
│   ├── mlt19
│
│
├── hiertext_train.lmdb
├── textocr_train.lmdb
├── synthtext150k.poly.part1_train.lmdb
├── ctw1500.poly_test.lmdb
              .
              .

Each lmdb file contains OCR annotation information and relative paths to the corresponding image filenames.
All images are stored in the "images" directory, with different directories depending on the dataset name.
For more detailed information about this structure, refer to the script/make_lmdb.py file. This source code provides the necessary functionality to modify the names or structure of the relative paths as needed.

Demo

You can visualize the network's predictions by running the following command:

PYTHONPATH=$PWD python script/demo.py --conf configs/finetune.py \
                                      --ckpt weights/shared.pt

You can also adjust the DETECT_TYPE parameter to modify the detection format for predictions.

Training

Checkpoints

pretrain
shared
ic15
totaltext
ctw
textocr

Examples

Here are the pre-training and fine-tuning configurations used in our experiments. The following commands demonstrate the experiment setup for 8 GPUs with A100 (80GB) memory.

# pre-train
PYTHONPATH=$PWD python script/train.py --conf configs/pretrain.py \
                                       --n_proc 8

# unified finetune
PYTHONPATH=$PWD python script/train.py --conf configs/finetune.py \
                                       --ckpt weights/pretrain.pt \
                                       --n_proc 8

# ic15 finetune
PYTHONPATH=$PWD python script/train.py --conf configs/ic15.py \
                                       --ckpt weights/shared.pt \
                                       --n_proc 8

In addition to ICDAR2015, fine-tuning config files for TotalText, CTW1500, and TextOCR dataset have also been uploaded.

Evaluation

Our implementation is based on AdelaiDet.

To evaluate on TotalText, CTW1500, ICDAR2015, TextOCR first download the gt.tar and vocab.tar

cd evaluation/datasets
mkdir evaluation
tar -xvf gt.tar
tar -xvf vocab.tar

The evaluation process consists of three steps:

Saving the prediction results as TXT files for each dataset.

Here is an example of extracting results for IC15:

PYTHONPATH=$PWD python script/test.py --conf configs/ic15.py \
                                      --ckpt weights/shared.pt

Combining the TXT files into a single JSON file.