HELL-TO-HEAVEN / CLIP4STR

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CLIP4STR

This is a dedicated re-implementation of CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model .

Table of Contents

Introduction

This is a third-party implementation of the paper CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model.

The framework of CLIP4STR. It has a visual branch and a cross-modal branch. The cross-modal branch refines the prediction of the visual branch for the final output. The text encoder is partially frozen.

CLIP4STR aims to build a scene text recognizer with the pre-trained vision-language model. In this re-implementation, we try to reproduce the performance of the original paper and evaluate the effectiveness of pre-trained VL models in the STR area.

Installation

Prepare data

First of all, you need to download the STR dataset.

Generally, directories are organized as follows:

${ABSOLUTE_ROOT}
├── dataset
│   │
│   └── str_dataset           
│       ├── train
│       │   ├── real
│       │   └── synth
│       ├── val     
│       └── test
│
├── code              
│   │
│   └── clip4str
│
├── output (save the output of the program)
│
│
├── pretrained
│   └── clip (download the CLIP pre-trained weights and put them here)
│       └── ViT-B-16.pt
│
...

Dependency

Requires Python >= 3.8 and PyTorch >= 1.12. The following commands are tested on a Linux machine with CUDA Driver Version 525.105.17 and CUDA Version 11.3.

conda create --name clip4str python==3.8
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 -c pytorch
pip install -r requirements.txt 

Results

CLIP4STR-B means using the CLIP-ViT-B/16 as the backbone, and CLIP4STR-L means using the CLIP-ViT-L/14 as the backbone.

Method Train data IIIT5K SVT IC13 IC15 IC15 SVTP CUTE HOST WOST
3,000 647 1,015 1,811 2,077 645 288 2,416 2,416
CLIP4STR-B MJ+ST 97.70 95.36 96.06 87.47 84.02 91.47 94.44 80.01 86.75
CLIP4STR-L MJ+ST 97.57 95.36 96.75 88.02 84.40 91.78 94.44 81.08 87.38
CLIP4STR-B Real(3.3M) 99.20 98.30 98.23 91.44 90.61 96.90 99.65 77.36 87.87
CLIP4STR-L Real(3.3M) 99.43 98.15 98.52 91.66 91.14 97.36 98.96 79.22 89.07
Method Train data COCO ArT Uber Checkpoint
9,825 35,149 80,551
CLIP4STR-B MJ+ST 66.69 72.82 43.52 a5e3386222
CLIP4STR-L MJ+ST 67.45 73.48 44.59 3544c362f0
CLIP4STR-B Real(3.3M) 80.80 85.74 86.70 d70bde1f2d
CLIP4STR-L Real(3.3M) 81.97 85.83 87.36 f125500adc

Training

  • Before training, you should set the path properly. Find the /PUT/YOUR/PATH/HERE in configs, scripts, strhub/vl_str, and strhub/str_adapter. For example, the /PUT/YOUR/PATH/HERE in the configs/main.yaml. Then replace it with your own path. A global searching and replacement is recommended.

For CLIP4STR with CLIP-ViT-B, refer to

bash scripts/vl4str_base.sh

8 NVIDIA GPUs with more than 24GB memory (per GPU) are required. For users with limited GPUs, you can change trainer.gpus=A, trainer.accumulate_grad_batches=B, and model.batch_size=C under the condition A * B * C = 1024 in the bash scripts. Do not modify the code, the PyTorch Lightning will handle the left.

For CLIP4STR with CLIP-ViT-L, refer to

bash scripts/vl4str_large.sh

We also provide the training script of CLIP4STR + Adapter described in the original paper,

bash scripts/str_adapter.sh

Inference

bash test.sh {gpu_id} {subpath_of_ckpt}

For example,

bash scripts/test.sh 0 clip4str_base16x16_d70bde1f2d.ckpt

If you want to read characters from an image, try:

bash test.sh {gpu_id} {subpath_of_ckpt} {image_folder_path}

For example,

bash scripts/read.sh 0 clip4str_base16x16_d70bde1f2d.ckpt misc/test_images

Output:
image_1576.jpeg: Chicken

Citations

A BibTeX of CLIP4STR from the DBLP is:

@article{DBLP:journals/corr/abs-2305-14014,
  author       = {Shuai Zhao and
                  Xiaohan Wang and
                  Linchao Zhu and
                  Yi Yang},
  title        = {{CLIP4STR:} {A} Simple Baseline for Scene Text Recognition with Pre-trained
                  Vision-Language Model},
  journal      = {CoRR},
  volume       = {abs/2305.14014},
  year         = {2023},
  url          = {https://doi.org/10.48550/arXiv.2305.14014},
  doi          = {10.48550/arXiv.2305.14014},
  eprinttype    = {arXiv},
  eprint       = {2305.14014},
  timestamp    = {Mon, 05 Jun 2023 15:42:15 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2305-14014.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Acknowledgements

This repo is built upon these previous works.

About

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".

License:Apache License 2.0


Languages

Language:Python 97.1%Language:Shell 2.9%