huyhoang17 / MATRN

Official PyTorch implementation for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features (MATRN).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

| paper |

Official PyTorch implementation for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features (MATRN).

This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

An overview of MATRN. A visual feature extractor and an LM extract visual and semantic features, respectively. By utilizing the attention map, representing relations between visual features and character positions, MATRNs encode spatial information into the semantic features and hide visual features related to a randomly selected character. Through the multi-modal feature enhancement module, visual and semantic features interact with each other and the enhanced features in two modalities are fused to finalize the output sequence.

Datasets

We use lmdb dataset for training and evaluation dataset. The datasets can be downloaded in clova (for validation and evaluation) and ABINet (for training and evaluation).

  • Training datasets
  • Validation datasets
  • Evaluation datasets
  • Tree structure of data directory
    data
    ├── charset_36.txt
    ├── evaluation
    │   ├── CUTE80
    │   ├── IC13_857
    │   ├── IC13_1015
    │   ├── IC15_1811
    │   ├── IC15_2077
    │   ├── IIIT5k_3000
    │   ├── SVT
    │   └── SVTP
    ├── training
    │   ├── MJ
    │   │   ├── MJ_test
    │   │   ├── MJ_train
    │   │   └── MJ_valid
    │   └── ST
    ├── validation
    ├── WikiText-103.csv
    └── WikiText-103_eval_d1.csv
    

Requirements

pip install torch==1.7.1 torchvision==0.8.2 fastai==1.0.60 lmdb pillow opencv-python

Pretrained Models

  • Download pretrained model of MATRN from this link. Performances of the pretrained models are:
Model IIIT SVT IC13S IC13L IC15S IC15L SVTP CUTE
MATRN 96.7 94.9 97.9 95.8 86.6 82.9 90.5 94.1
  • If you want to train with pretrained visioan and language model, download pretrained model of vision and language model from ABINet.

Training and Evaluation

  • Training
python main.py --config=configs/train_matrn.yaml
  • Evaluation
python main.py --config=configs/train_matrn.yaml --phase test --image_only

Additional flags:

  • --checkpoint /path/to/checkpoint set the path of evaluation model
  • --test_root /path/to/dataset set the path of evaluation dataset
  • --model_eval [alignment|vision|language] which sub-model to evaluate
  • --image_only disable dumping visualization of attention masks

Acknowledgements

This implementation has been based on ABINet.

Citation

Please cite this work in your publications if it helps your research.

@article{na2021multi,
  title={Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features},
  author={Na, Byeonghu and Kim, Yoonsik and Park, Sungrae},
  journal={arXiv preprint arXiv:2111.15263},
  year={2021}
}

About

Official PyTorch implementation for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features (MATRN).

License:MIT License


Languages

Language:Python 99.2%Language:Dockerfile 0.8%