sunzhuojun / Scene-Text-Recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scene Text Recognition Resources

Author: 陈晓雪

Updates

Dec 24, 2019: add 20 papers and update corresponding tables.

Feb 29, 2020: add AAAI-2020 papers and update corresponding tables. You can download the new Excel prepared by us. (Password: bt74)


1. Datasets

1.1 Regular Latin Datasets

  • IIIT5K[31]:
    • Introduction: The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3; 000 for testing. It contains words from street scenes and from originally-digital images. Every image is associated with a 50-word lexicon and a 1,000-word lexicon. Specifically, the lexicon consists of a ground-truth word and some randomly picked words.
    • Link: IIIT5K-download
  • SVT[1]:
    • Introduction: The SVT dataset contains 350 images: 100 for training and 250 for testing. Some images are severely corrupted by noise, blur, and low resolution. Besides, each image is associated with a 50-word lexicon.
    • Link: SVT-download
  • ICDAR 2003(IC03)[33]:
    • Introduction: The IC03 dataset contains 509 images: 258 for training and 251 for testing. Specifically, it contains 867 cropped text instances after discarding images that contain non- alphanumeric characters or less than three characters. Every image is associated with a 50-word lexicon and a fullword lexicon. Moreover, the full lexicon combines all lexicon words.
    • Link: IC03-download
  • ICDAR 2013(IC13)[34]:
    • Introduction: The IC13 dataset contains 561 images: 420 for training and 141 for testing. It mostly inherits from the IC03 dataset and extends it with new images. Similar to IC03 dataset, the IC13 dataset contains 1,015 cropped text instance images after removing the words with non-alphanumeric characters. Besides, no lexicon is associated with IC13. Notably, 215 duplicate text instance images exist between the IC03 training dataset and the IC13 testing dataset. Therefore, researchers should be mindful of these overlapping data when evaluating a model on the IC13 testing data.
    • Link: IC13-download
  • COCO-Text[38]:
    • Introduction: The COCO-Text dataset contains 63,686 images with 145,859 cropped text instances. It is the first large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text. However, no lexicon is associated with COCO-Text.
    • Link: COCO-Text-download
  • SVHN[45]:
    • Introduction: The SVHN dataset contains more than 600,000 digits of house numbers in natural scenes. It is obtained from a large number of street view images using a combination of automated algorithms and the Amazon Mechanical Turk (AMT) framework. The SVHN dataset was typically used for scene digit recognition.
    • Link: SVHN-download

1.2 Irregular Latin Datasets

  • SVT-P[35]:
    • Introduction: The SVT-P dataset contains 238 images with 639 cropped text instances. It is specifically designed to evaluate perspective distorted text recognition. It is built based on the original SVT dataset by selecting the images at the same address on Google Street View but with different view angles. Therefore, most text instances are heavily distorted by the non-frontal view angle. Moreover, each image is associated with a 50-word lexicon and a full-word lexicon.
    • Link: SVT-P-download (Password : vnis)
  • CUTE80[36]:
    • Introduction: The CUTE80 dataset contains 80 high-resolution images with 288 cropped text instances. It focuses on curved text recognition. Most images in CUTE80 have a complex background, perspective distortion, and poor resolution. Besides, no lexicon is associated with CUTE80.
    • Link: CUTE80-download
  • ICDAR 2015(IC15)[37]:
    • Introduction: The IC15 dataset contains 1,500 images: 1,000 for training and 500 for testing. Specifically, it contains 2,077 cropped text instances, including more than 200 irregular text samples. As text images were taken by Google Glasses without ensuring the image quality, most of the text is very small, blurred, and multi-oriented. Besides, no lexicon is provided.
    • Link: IC15-download
  • Total-Text[39]:
    • Introduction: The Total-Text contains 1,555 images with 11,459 cropped text instance images. It focuses on curved scene text recognition. Images in Total-Text have more than three different orientations, including horizontal, multi-oriented, and curved. No lexicon is associated with Total-Text.
    • Link: Total-Text-download

1.3 Multilingual Datasets

  • RCTW-17(RCTW competition,ICDAR17)[40]:
    • Introduction: The RCTW-17 dataset contains 12,514 images: 11,514 for training and 1,000 for testing. Most are natural images collected by cameras or mobile phones, whereas others are digital-born. Text instances are annotated with labels, fonts, languages, etc.
    • Link: RCTW-17-download
  • MTWI(competition)[41]:
    • Introduction: The MTWI dataset contains 20,000 images. This is the first dataset constructed by Chinese and Latin web text. Most images in MTWI have a relatively high resolution and cover diverse types of web text, including multi-oriented text, tightly-stacked text, and complex-shaped text.
    • Link: MTWI-download (Password:gox9)
  • CTW[42]:
    • Introduction: The CTW dataset includes 32,285 high-resolution street view images with 1,018,402 character instances. All images have character-level annotations: the underlying character, the bounding box, and six other attributes.
    • Link: CTW-download
  • SCUT-CTW1500[43]:
    • Introduction: The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text instance images, including 3,530 with curved text. The images are manually harvested from the Internet, image libraries such as Google Open-Image, or phone cameras. The dataset contains a lot of horizontal and multi-oriented text
    • Link: SCUT-CTW1500-download
  • LSVT(LSVT competition, ICDAR2019)[57]:
    • Introduction: The LSVT dataset contains 20,000 testing samples, 30,000 fully annotated training samples, and 400,000 training samples with weak annotations (i.e., with partial labels). All images are captured from streets and reflect a large variety of complicated real-world scenarios, e.g., store fronts and landmarks.
    • Link: LSVT-download
  • ArT(ArT competition, ICDAR2019)[58]:
    • Introduction: The ArT dataset contains 10,166 images: 5,603 for training and 4,563 for testing. ArT is a combination of Total-Text, SCUT-CTW1500, and Baidu Curved Scene Text, which was collected to introduce the arbitrary-shaped text problem to the scene text community. Moreover, all existing text shapes (i.e., horizontal, multi-oriented, and curved) have multiple occurrences in the ArT dataset.
    • Link: ArT-download
  • ReCTS-25k(ReCTS competition, ICDAR2019)[59]:
    • Introduction: The ReCTS-25k dataset contains 25,000 images: 20,000 for training and 5,000 for testing. All the text lines and characters are annotated with locations and transcriptions. All the images are from the MeituanDianping Group, collected by Meituan business merchants, using phone cameras under uncontrolled conditions. Specifically, ReCTS-25k dataset mainly focuses on Chinese text reading on the signboards.
    • Link: ReCTS-download
  • MLT(MLTcompetition, ICDAR2019) [81]:
    • Introduction: The MLT-2019 dataset contains 20,000 images: 10,000 for training (1,000 per language) and 10,000 for testing. The dataset includes ten languages, representing seven different scripts: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean. The number of images per script is identical.
    • Link: MLT-download

1.4 Synthetic Datasets

  • Synth90k [53] :
    • Introduction: The Synth90k dataset contains 9 million synthetic text instance images from a set of 90k common English words. Words are rendered onto natural images with random transformations and effects, such as random fonts, colors, blur, and noises. Synth90k dataset can emulate the distribution of scene text images and can be used instead of real-world data to train data-hungry deep learning algorithms. Besides, every image is annotated with a ground-truth word.
    • Link: Synth90k-download
  • SynthText [54] :
    • Introduction: The SynthText dataset contains 800,000 images with 6 million synthetic text instances. As in the generation of Synth90k dataset, the text sample is rendered using a randomly selected font and transformed according to the local surface orientation. Moreover, each image is annotated with a ground-truth word.
    • Link: SynthText-download

1.5 Comparison of the Benchmark Datasets

Comparison of the Benchmark Datasets
           Datasets            Language Images Lexicon Label Type
Pictures Training Pictures Testing Pictures Instances Training Instances Testing Instances 50 1k Full None Char Word
IIIT5K[31] English 1120 380 740 5000 2000 3000 × Regular
SVT[32] English 350 100 250 725 211 514 × × × Regular
IC03[33] English 509 258 251 2268 1157 1111 Regular
IC13[34] English 561 420 141 5003 3564 1439 × × × Regular
COCO-Text[38] English 63686 43686 10000 145859 118309 27550 × × × × Regular
SVHN[45] Digits 600000 573968 26032 600000 573968 26032 × × × Regular
SVT-P[35] English 238 0 238 639 0 639 × × Irregular
CUTE80[36] English 80 0 80 288 0 288 × × × × Irregular
IC15[37] English 1500 1000 500 6545 4468 2077 × × × × Irregular
Total-Text[39] English 1555 1255 300 11459 11166 293 × × × × Irregular
RCTW-17[40] Chinese/English 12514 11514 1000 - - - × × × × Regular
MTWI[41] Chinese/English 20000 10000 10000 290206 141476 148730 × × × × Regular
CTW[42] Chinese/English 32285 25887 3269 1018402 812872 103519 × × × Regular
SCUT-CTW1500[43] Chinese/English 1500 1000 500 10751 7683 3068 × × × × Irregular
LSVT[57], [63] Chinese/English 450000 30000 20000 - - - × × × × Irregular
ArT[58] Chinese/English 10166 5603 4563 98455 50029 48426 × × × × Irregular
ReCTS-25k[59] Chinese/English 25000 20000 5000 119713 108924 10789 × × × Irregular
MLT[81] Multilingual 20000 10000 10000 191639 89177 102462 × × × × Irregular
Synth90k[53] English ~9000000 - - ~9000000 - - × × × × Regular
SynthText[54] English ~6000000 - - ~6000000 - - × × × × Regular

2. Performance Comparison of Recognition Algorithms

2.1 Characteristics Comparison of Recognition Approaches

It is notable that 1) "Reg" stands for regular Latin datasets. 2) "Irreg" stands for irregular Latin datasets. 3) "Seg" denotes the segmentation-based methods. 4) "Extra" indicates the methods that use the extra datasets other than Synth90k and SynthText. 5) "CTC" represents the methods that apply the CTC-based algorithm to decode. 6) "Attn" represents the method that apply the attention mechanism to decode.

You can also download the new Excel prepared by us. (Password: bt74)

Characteristics Comparison of Recognition Approaches
                          Method                           Code Reg Irreg Seg Extra CTC Attn Source Time                                                                         Highlight                                                                        
Wang et al. [1] : ABBYY × × × × ICCV 2011 a state-of-the-art text detector + a leading commercial OCR engine
Wang et al. [1] : SYNTH+PLEX × × × × × ICCV 2011 the baseline of scene text recognition
Mishra et al. [2] × × × × × BMVC 2012 1) incorporating higher order statistical language models to recognize words in an unconstrained manner 2) introducing IIIT5K-word dataset
Wang et al. [3] × × × × ICPR 2012 CNNs + Non-maximal suppression + beam search
Goel et al. [4] : wDTW × × × × × ICDAR 2013 recognizing text by matching the scene and synthetic image features with wDTW
Bissacco et al. [5] : PhotoOCR × × × × × ICCV 2013 applying a network with five hidden layers for character classification
Phan et al. [6] × × × × × ICCV 2013 1) MSER + SIFT descriptors + SVM 2) introducing the SVT-P datasets
Alsharif et al. [7] : HMM/Maxout × × × × × ICLR 2014 convolutional Maxout networks + Hybrid HMM
Almazan et al [8] : KCSR × × × × × TPAMI 2014 embedding word images and text strings in a common vectorial subspace and interpreting the task of recognition and retrieval as a nearest neighbor problem
Yao et al. [9] : Strokelets × × × × × CVPR 2014 proposing a novel multi-scale representation for scene text recognition: strokelets
R.-Serrano et al.[10] : Label embedding × × × × × × IJCV 2015 embedding word labels and word images into a common Euclidean space and finding the cloest word label in this space
Jaderberg et al. [11] × × × × ECCV 2014 1) enabling efficient feature sharing for text detection and classification 2) making technical changes over the traditional CNN architectures 3) proposing a method of automated data mining of Flickr.
Su and Lu [12] × × × × × ACCV 2014 HOG + BLSTM + CTC
Gordo[13] : Mid-features × × × × × CVPR 2015 proposing local mid-level features for building word image representations
Jaderberg et al. [14] × × × × × IJCV 2015 1) treating each word as a category and training very large convolutional neural networks to perform word recognition on the whole proposal region 2) generating 9 million images with equal numbers of word samples from a 90k word dictionary
Jaderberg et al. [15] × × × × × × ICLR 2015 CNN + CRF
Shi, Bai, and Yao [16] : CRNN × × × × TPAMI 2017 CNN + BLSTM + CTC
Shi et al. [17] : RARE × × × × × CVPR 2016 STN + CNN + attentional BLSTM
Lee and Osindero [18] : R2AM × × × × × CVPR 2016 presenting recursive recurrent neural networks with attention modeling
Liu et al. [19] : STAR-Net × × × × × BMVC 2016 STN + ResNet + BLSTM + CTC
Liu et al. [78] × × × × ICPR 2016 integrating the CNN and WFST classification model
Mishra et al. [77] × × × × CVIU 2016 character detection (HOG/CNN + SVM +Sliding window) + CRF, combining bottom-up cues from character detection and top-down cues from lexicon
Su and Lu [76] × × × × PR 2017 HOG(different scale) + BLSTM + CTC (ensemble)
*Yang et al. [20] × × × × IJCAI 2017 1) CNN + 2D attention-based RNN, applying an auxiliary dense character detection task that helps to learn text specific visual patterns 2) developing a large-scale synthetic dataset
Yin et al. [21] × × × × × ICCV 2017 CNN + CTC
Wang et al.[66] : GRCNN × × × × NIPS 2017 Gated Recurrent Convulution Layer + BLSTM + CTC
*Cheng et al. [22] : FAN × × × × ICCV 2017 1) proposing the concept of attention drift 2)introducing focusing network to focus deviated attention back on the target areas
Cheng et al. [23] : AON × × × × × CVPR 2018 1) extracting scene text features in four directions 2) CNN + Attentional BLSTM
Gao et al. [24] × × × × NC 2019 attentional ResNet + CNN + CTC
Liu et al. [25] : Char-Net × × × × AAAI 2018 CNN + STN (facilitating the rectification of individual characters) + LSTM
*Liu et al. [26] : SqueezedText × × × × × AAAI 2018 binary convolutional encoder-decoder network + Bi-RNN
Zhan et al.[73] × × × CVPR 2018 CRNN, achieving verisimilar scene text image synthesis by combining three novel designs, including semantic coherence, visual attention, and adaptive text appearance
*Bai et al. [27] : EP × × × × CVPR 2018 proposing edit probability to effectively handle the misalignment between the training text and the output probability distribution sequence
Fang et al.[74] × × × × MultiMedia 2018 ResNet + [2D Attentional CNN, CNN-based language module]
Liu et al.[75] : EnEsCTC × × × × NIPS 2018 proposing a novel maximum entropy based regularization for CTC (EnCTC) and an entropy-based pruning method (EsCTC) to effectively reduce the space of the feasible set
Liu et al. [28] × × × × × ECCV 2018 designing a multi-task network with an encoder-discriminator-generator architecture to guide the feature of the original image toward that of the clean image
Wang et al.[61] : MAAN × × × × × ICFHR 2018 ResNet + BLSTM + Memory-Augmented attentional decoder
Gao et al. [29] × × × × ICIP 2018 attentional DenseNet + BLSTM + CTC
Shi et al. [30] : ASTER × × × × TPAMI 2018 TPS + ResNet + bidirectional attention-based BLSTM
Chen et al. [60] : ASTER + AEG × × × × × NC 2019 TPS + ResNet + bidirectional attention-based BLSTM + AEG
Luo et al. [46] : MORAN × × × × PR 2019 Multi-object rectification network + CNN + attentional BLSTM
Luo et al. [61] : MORAN-v2 × × × × PR 2019 Multi-object rectification network + ResNet + attentional BLSTM
Chen et al. [60] : MORAN-v2 + AEG × × × × × NC 2019 Multi-object rectification network + ResNet + attentional BLSTM + AEG
Xie et al. [47] : CAN × × × × × ACM 2019 ResNet + CNN + GLU
*Liao et al.[48] : CA-FCN × × × AAAI 2019 performing character classification at each pixel location and needing character-level annotations
*Li et al. [49] : SAR × × × AAAI 2019 ResNet + 2D attentional LSTM
Zhan el at. [55]: ESIR × × × × × CVPR 2019 Iterative rectification Network + ResNet + attentional BLSTM
Zhang et al. [56]: SSDAN × × × × CVPR 2019 attentional CNN + GAS + GRU
Yang et al. [62]: ScRN × × × × ICCV 2019 Symmetry-constrained Rectification Network + ResNet + BLSTM + attentional GRU
Wang et al. [64]: GCAM × × × × × ICME 2019 Convolutional Block Attention Module (CBAM) + ResNet + BLSTM + the proposed Gated Cascade Attention Module (GCAM)
Jeonghun et al. [65] × × × × ICCV 2019 TPS + ResNet + BLSTM + Attention Mechanism
Huang et al. [67] : EPAN × × × × × NC 2019 learning to sample features from the text region of 2D feature maps and innovatively introducing a two-stage attention mechanism
Gao et al. [68] × × × × × NC 2019 attentional DenseNET + 4-layer CNN + CTC
Qi et al. [69] : CCL × × × × ICDAR 2019 ResNet + [CTC, CCL]
Wang et al. [70] : ReELFA × × × × ICDAR 2019 VGG + attentional LSTM, utilizing one-hot encoded coordinates to indicate the spatial relationship of pixels and character center masks to help focus attention on the right feature areas
Zhu et al. [71] : HATN × × × × ICIP 2019 ResNet50 + Hierarchical Attention Mechanism (Transformer structure)
Zhan et al. [72] : SF-GAN × × × × CVPR 2019 ResNet50 + attentional Decoder, synthesising realistic scene text images for training better recognition models
Liao et al. [79] : SAM × × × × TPAMI 2019 Spatial attentional module (SAM)
Liao et al. [79] : seg-SAM × × × TPAMI 2019 Character segmentation module + Spatial attention module (SAM)
Wang et al. [80] : DAN × × × × AAAI 2020 decoupling the decoder of the traditional attention mechanism into a convolutional alignment module and a decoupled text decoder
Wang et al. [82] : TextSR × × × × arXiv 2019 attempting to solve small texts with super-resolution methods
Wan et al. [83] : TextScanner × × × × AAAI 2020 an effective segmentation-based dual-branch framework for scene text recognition
Hu et al. [84] : GTC × × × AAAI 2020 attempting to use GCN to learn the local correlations of feature sequence
Luo et al. [85] : × × × × × arXiv 2020 separating text content from noisy background styles

2.2 Performance Comparison on Benchmark Datasets

In this section, we compare the performance of the current advanced algorithms on benchmark datasets, including IIIT5K,SVT,IC03,IC13,SVT-P,CUTE80,IC15,RCTW-17, MWTI, CTW,SCUT-CTW1500, LSVT, ArT and ReCTS-25k.

It is notable that 1) The '*' indicates the methods that use the extra datasets other than Synth90k and SynthText. 2) The bold represents the best recognition results. 3) '^' denotes the best recognition results of using extra datasets. 4) '@' represents the methods under different evaluation that only uses 1811 test images. 5) 'SK', 'ST', 'ExPu', 'ExPr' and 'Un' indicates the methods that use Synth90K, SynthText, Extra Public Data, Extra Private Data and unknown data, respectively. 6) 'D_A' means data augmentation.

2.2.1 Performance Comparison of Recognition Algorithms on Regular Latin Datasets

Performance Comparison of Recognition Algorithms on Regular Latin Datasets
                          Method                           IIIT5K SVT IC03 IC13                         Data                         Source Time
50 1K None 50 None 50 Full 50k None None
Wang et al. [1] : ABBYY 24.3 - - 35 - 56 55 - - - Un ICCV 2011
Wang et al. [1] : SYNTH+PLEX - - - 57 - 76 62 - - - ExPr ICCV 2011
Mishra et al. [2] 64.1 57.5 - 73.2 - 81.8 67.8 - - - ExPu BMVC 2012
Wang et al. [3] - - - 70 - 90 84 - - - ExPr ICPR 2012
Goel et al. [4] : wDTW - - - 77.3 - 89.7 - - - - Un ICDAR 2013
Bissacco et al. [5] : PhotoOCR - - - 90.4 78 - - - - 87.6 ExPr ICCV 2013
Phan et al. [6] - - - 73.7 - 82.2 - - - - ExPu ICCV 2013
Alsharif et al. [7] : HMM/Maxout - - - 74.3 - 93.1 88.6 85.1 - - ExPu ICLR 2014
Almazan et al [8] : KCSR 88.6 75.6 - 87 - - - - - - ExPu TPAMI 2014
Yao et al. [9] : Strokelets 80.2 69.3 - 75.9 - 88.5 80.3 - - - ExPu CVPR 2014
R.-Serrano et al.[10] : Label embedding 76.1 57.4 - 70 - - - - - - ExPu IJCV 2015
Jaderberg et al. [11] - - - 86.1 - 96.2 91.5 - - - ExPu ECCV 2014
Su and Lu [12] - - - 83 - 92 82 - - - ExPu ACCV 2014
Gordo[13] : Mid-features 93.3 86.6 - 91.8 - - - - - - ExPu CVPR 2015
Jaderberg et al. [14] 97.1 92.7 - 95.4 80.7 98.7 98.6 93.3 93.1 90.8 ExPr IJCV 2015
Jaderberg et al. [15] 95.5 89.6 - 93.2 71.7 97.8 97 93.4 89.6 81.8 SK + ExPr ICLR 2015
Shi, Bai, and Yao [16] : CRNN 97.8 95 81.2 97.5 82.7 98.7 98 95.7 91.9 89.6 SK TPAMI 2017
Shi et al. [17] : RARE 96.2 93.8 81.9 95.5 81.9 98.3 96.2 94.8 90.1 88.6 SK CVPR 2016
Lee and Osindero [18] : R2AM 96.8 94.4 78.4 96.3 80.7 97.9 97 - 88.7 90 SK CVPR 2016
Liu et al. [19] : STAR-Net 97.7 94.5 83.3 95.5 83.6 96.9 95.3 - 89.9 89.1 SK + ExPr BMVC 2016
*Liu et al. [78] 94.1 84.7 - 92.5 - 96.8 92.2 - - - ExPu (D_A) ICPR 2016
*Mishra et al. [77] 78.07 - 46.73 78.2 - 88 - - 67.7 60.18 ExPu (D_A) CVIU 2016
*Su and Lu [76] - - - 91 - 95 89 - - 76 SK + ExPu PR 2017
*Yang et al. [20] 97.8 96.1 - 95.2 - 97.7 - - - - ExPu IJCAI 2017
Yin et al. [21] 98.7 96.1 78.2 95.1 72.5 97.6 96.5 - 81.1 81.4 SK ICCV 2017
Wang et al.[66] : GRCNN 98 95.6 80.8 96.3 81.5 98.8 97.8 - 91.2 - SK NIPS 2017
*Cheng et al. [22] : FAN 99.3 97.5 87.4 97.1 85.9 99.2 97.3 - 94.2 93.3 SK + ST (Pixel_wise) ICCV 2017
Cheng et al. [23] : AON 99.6 98.1 87 96 82.8 98.5 97.1 - 91.5 - SK + ST (D_A) CVPR 2018
Gao et al. [24] 99.1 97.9 81.8 97.4 82.7 98.7 96.7 - 89.2 88 SK NC 2019
Liu et al. [25] : Char-Net - - 83.6 - 84.4 - 93.3 - 91.5 90.8 SK (D_A) AAAI 2018
*Liu et al. [26] : SqueezedText 97 94.1 87 95.2 - 98.8 97.9 93.8 93.1 92.9 ExPr AAAI 2018
*Zhan et al.[73] 98.1 95.3 79.3 96.7 81.5 - - - - 87.1 Pr(5 million) CVPR 2018
*Bai et al. [27] : EP 99.5 97.9 88.3 96.6 87.5 98.7 97.9 - 94.6 94.4 SK + ST (Pixel_wise) CVPR 2018
Fang et al.[74] 98.5 96.8 86.7 97.8 86.7 99.3 98.4 - 94.8 93.5 SK + ST MultiMedia 2018
Liu et al.[75] : EnEsCTC - - 82 - 80.6 - - - 92 90.6 SK NIPS 2018
Liu et al. [28] 97.3 96.1 89.4 96.8 87.1 98.1 97.5 - 94.7 94 SK ECCV 2018
Wang et al.[61] : MAAN 98.3 96.4 84.1 96.4 83.5 97.4 96.4 - 92.2 91.1 SK ICFHR 2018
Gao et al. [29] 99.1 97.2 83.6 97.7 83.9 98.6 96.6 - 91.4 89.5 SK ICIP 2018
Shi et al. [30] : ASTER 99.6 98.8 93.4 97.4 89.5 98.8 98 - 94.5 91.8 SK + ST TPAMI 2018
Chen et al. [60] : ASTER + AEG 99.5 98.5 94.4 97.4 90.3 99 98.3 - 95.2 95 SK + ST NC 2019
Luo et al. [46] : MORAN 97.9 96.2 91.2 96.6 88.3 98.7 97.8 - 95 92.4 SK + ST PR 2019
Luo et al. [61] : MORAN-v2 - - 93.4 - 88.3 - - - 94.2 93.2 SK + ST PR 2019
Chen et al. [60] : MORAN-v2 + AEG 99.5 98.7 94.6 97.4 90.4 98.8 98.3 - 95.3 95.3 SK + ST NC 2019
Xie et al. [47] : CAN 97 94.2 80.5 96.9 83.4 98.4 97.8 - 91 90.5 SK ACM 2019
*Liao et al.[48] : CA-FCN ^99.8 98.9 92 98.8 82.1 - - - - 91.4 SK + ST+ ExPr AAAI 2019
*Li et al. [49] : SAR 99.4 98.2 95 98.5 91.2 - - - - 94 SK + ST + ExPr AAAI 2019
Zhan el at. [55]: ESIR 99.6 98.8 93.3 97.4 90.2 - - - - 91.3 SK + ST CVPR 2019
Zhang et al. [56]: SSDAN - - 83.8 - 84.5 - - - 92.1 91.8 SK CVPR 2019
*Yang et al. [62]: ScRN 99.5 98.8 94.4 97.2 88.9 99 98.3 - 95 93.9 SK + ST(char-level + word-level) ICCV 2019
Wang et al. [64]: GCAM - - 93.9 - 91.3 - - - 95.3 95.7 SK + ST ICME 2019
Jeonghun et al. [65] - - 87.9 - 87.5 - - - 94.4 92.3 SK + ST ICCV 2019
Huang et al. [67]:EPAN 98.9 97.8 94 96.6 88.9 98.7 98 - 95 94.5 SK + ST NC 2019
Gao et al. [68] 99.1 97.9 81.8 97.4 82.7 98.7 96.7 - 89.2 88 SK NC 2019
*Qi et al. [69] : CCL 99.6 99.1 91.1 98 85.9 99.2 ^98.8 - 93.5 92.8 SK + ST(char-level + word-level) ICDAR 2019
*Wang et al. [70] : ReELFA 99.2 98.1 90.9 - 82.7 - - - - - ST(char-level + word-level) ICDAR 2019
*Zhu et al. [71] : HATN - - 88.6 - 82.2 - - - 91.3 91.1 SK(D_A) + Pu ICIP 2019
*Zhan et al. [72] : SF-GAN - - 63 - 69.3 - - - - 61.8 Pr(1 million) CVPR 2019
Liao et al. [79] : SAM 99.4 98.6 93.9 98.6 90.6 98.8 98 - 95.2 95.3 SK + ST TPAMI 2019
*Liao et al. [79] : seg-SAM ^99.8 ^99.3 95.3 ^99.1 91.8 99 97.9 - 95 95.3 SK + ST (char-level) TPAMI 2019
Wang et al. [80] : DAN - - 94.3 - 89.2 - - - 95 93.9 SK + ST AAAI 2020
Wang et al. [82] : TextSR - - 92.5 98 87.2 - - - 93.2 91.3 SK + ST arXiv 2019
*Wan et al. [83] : TextScanner 99.7 99.1 93.9 98.5 90.1 - - - - 92.9 SK + ST (char-level) AAAI 2020
*Hu et al. [84] : GTC - - ^95.8 - ^92.9 - - - 95.5 94.4 SK + ST + ExPu AAAI 2020
Luo et al. [85] 99.6 98.7 95.4 98.9 92.7 99.1 98.8 - 96.3 94.8 SK + ST arXiv 2020

2.2.2 Performance Comparison of Recognition Algorithms on Irregular Latin Datasets

Performance Comparison of Recognition Algorithms on Irregular Latin Datasets
                          Method                           SVT-P CUTE80 IC15 COCO-TEXT                         Data                         Source Time
50 Full None None None None
Wang et al. [1] : ABBYY 40.5 26.1 - - - - Un ICCV 2011
Wang et al. [1] : SYNTH+PLEX - - - - - - ExPr ICCV 2011
Mishra et al. [2] 45.7 24.7 - - - - ExPu BMVC 2012
Wang et al. [3] 40.2 32.4 - - - - ExPr ICPR 2012
Goel et al. [4] : wDTW - - - - - - Un ICDAR 2013
Bissacco et al. [5] : PhotoOCR - - - - - - ExPr ICCV 2013
Phan et al. [6] 62.3 42.2 - - - - ExPu ICCV 2013
Alsharif et al. [7] : HMM/Maxout - - - - - - ExPu ICLR 2014
Almazan et al [8] : KCSR - - - - - - ExPu TPAMI 2014
Yao et al. [9] : Strokelets - - - - - - ExPu CVPR 2014
R.-Serrano et al.[10] : Label embedding - - - - - - ExPu IJCV 2015
Jaderberg et al. [11] - - - - - - ExPu ECCV 2014
Su and Lu [12] - - - - - - ExPu ACCV 2014
Gordo[13] : Mid-features - - - - - - ExPu CVPR 2015
Jaderberg et al. [14] - - - - - - ExPr IJCV 2015
Jaderberg et al. [15] - - - - - - SK + ExPr ICLR 2015
Shi, Bai, and Yao [16] : CRNN - - - - - - SK TPAMI 2017
Shi et al. [17] : RARE 91.2 77.4 71.8 59.2 - - SK CVPR 2016
Lee and Osindero [18] : R2AM - - - - - - SK CVPR 2016
Liu et al. [19] : STAR-Net 94.3 83.6 73.5 - - - SK + ExPr BMVC 2016
*Liu et al. [78] - - - - - - ExPu (D_A) ICPR 2016
*Mishra et al. [77] - - - - - - ExPu (D_A) CVIU 2016
*Su and Lu [76] - - - - - - SK + ExPu PR 2017
*Yang et al. [20] 93 80.2 75.8 69.3 - - ExPu IJCAI 2017
Yin et al. [21] - - - - - - SK ICCV 2017
Wang et al.[66] : GRCNN - - - - - - SK NIPS 2017
*Cheng et al. [22] : FAN - - - - *85.3 - SK + ST (Pixel_wise) ICCV 2017
Cheng et al. [23] : AON 94 83.7 73 76.8 68.2 - SK + ST (D_A) CVPR 2018
Gao et al. [24] - - - - - - SK NC 2019
Liu et al. [25] : Char-Net - - 73.5 - 60 - SK (D_A) AAAI 2018
*Liu et al. [26] : SqueezedText - - - - - - ExPr AAAI 2018
*Zhan et al.[73] - - - - - - Pr(5 million) CVPR 2018
*Bai et al. [27] : EP - - - - 73.9 - SK + ST (Pixel_wise) CVPR 2018
Fang et al.[74] - - - - 71.2 - SK + ST MultiMedia 2018
Liu et al.[75] : EnEsCTC - - - - - - SK NIPS 2018
Liu et al. [28] - - 73.9 62.5 - - SK ECCV 2018
Wang et al.[61] : MAAN - - - - - - SK ICFHR 2018
Gao et al. [29] - - - - - - SK ICIP 2018
Shi et al. [30] : ASTER - - 78.5 79.5 76.1 - SK + ST TPAMI 2018
Chen et al. [60] : ASTER + AEG 94.4 89.5 82 80.9 76.7 - SK + ST NC 2019
Luo et al. [46] : MORAN 94.3 86.7 76.1 77.4 68.8 - SK + ST PR 2019
Luo et al. [61] : MORAN-v2 - - 79.7 81.9 73.9 - SK + ST PR 2019
Chen et al. [60] : MORAN-v2 + AEG 94.7 89.6 82.8 81.3 77.4 - SK + ST NC 2019
Xie et al. [47] : CAN - - - - - - SK ACM 2019
*Liao et al.[48] : CA-FCN - - - 78.1 - - SK + ST+ ExPr AAAI 2019
*Li et al. [49] : SAR ^95.8 91.2 ^86.4 89.6 78.8 ^66.8 SK + ST + ExPr AAAI 2019
Zhan el at. [55]: ESIR - - 79.6 83.3 76.9 - SK + ST CVPR 2019
Zhang et al. [56]: SSDAN - - - - - - SK CVPR 2019
*Yang et al. [62]: ScRN - - 80.8 87.5 78.7 - SK + ST(char-level + word-level) ICCV 2019
Wang et al. [64]: GCAM - - 85.7 83.3 83.5 - SK + ST ICME 2019
Jeonghun et al. [65] - - 79.2 74 71.8 - SK + ST ICCV 2019
Huang et al. [67]:EPAN 91.2 86.4 79.4 82.6 73.9 - SK + ST NC 2019
Gao et al. [68] - - - - 62.3 40 SK NC 2019
*Qi et al. [69] : CCL - - - - 72.9 - SK + ST(char-level + word-level) ICDAR 2019
*Wang et al. [70] : ReELFA - - - 82.3 68.5 - ST(char-level + word-level) ICDAR 2019
*Zhu et al. [71] : HATN - - 73.5 75.7 70.1 - SK(D_A) + Pu ICIP 2019
*Zhan et al. [72] : SF-GAN - - 48.6 40.6 39 - Pr(1 million) CVPR 2019
Liao et al. [79] : SAM - - 82.2 87.8 77.3 - SK + ST TPAMI 2019
*Liao et al. [79] : seg-SAM - - 83.6 88.5 78.2 - SK + ST (char-level) TPAMI 2019
Wang et al. [80] : DAN - - 80 84.4 74.5 - SK + ST AAAI 2020
Wang et al. [82] : TextSR - - 77.4 78.9 75.6 - SK + ST arXiv 2019
*Wan et al. [83] : TextScanner - - 84.3 83.3 79.4 - SK + ST (char-level) AAAI 2020
*Hu et al. [84] : GTC - - 85.7 ^92.2 79.5 - SK + ST + ExPu AAAI 2020
Luo et al. [85] 95.5 92.2 85.4 89.6 83.7 - SK + ST arXiv 2020

2.2.3 Performance Comparison of Recognition Algorithms on Multilingual Datasets

In this section, we only list the top one of each competition. Please refer to the competition websites for more information.

Performance Comparison of Recognition Algorithms on Multilingual Datasets
Competition Detection End-to-End
Team Name Protocol Result       Team Name       Protocol Result
RCTW Foo & Bar F-score 66.1 NLPR_PAL 1-NED 67.99
MTWI nelslip(iflytek&ustc) F-score 79.6 nelslip(iflytek&ustc) F-score 81.5
LSVT Tencent-DPPR Team F-score 86.42 Tencent-DPPR Team F-score 60.97
ArT pil_maskrcnn F-score 82.65 baseline_0.5_class_5435 F-score 50.17
ReCTS SANHL F-score 93.36 Tencent-DPPR 1-NED 81.5
MLT Tencent-DPPR Team F-score 83.61 Tencent-DPPR Team & USTB-PRIR F-score 59.15

3. Survey

[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[52] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper


4. OCR Service

OCR API Free Code
Tesseract OCR Engine ×
Azure ×
ABBYY ×
OCR Space ×
SODA PDF OCR ×
Free Online OCR ×
Online OCR ×
Super Tools ×
Online Chinese Recognition ×
Calamari OCR ×
Tencent OCR × ×

5. References

[1] [ICCV-2011] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of International Conference on Computer Vision (ICCV), pages 1457–1464, 2011. paper

[2] [BMVC-2012] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In Proceedings of British Machine Vision Conference (BMVC), pages 1–11, 2012. paper dataset

[3] [ICPR-2012] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Proceedings of International Conference on Pattern Recognition (ICPR), pages 3304–3308, 2012. paper

[4] [ICDAR-2013] V. Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), pages 398–402, 2013. paper

[5] [ICCV-2013] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of International Conference on Computer Vision (ICCV), pages 785–792, 2013. paper

[6] [ICCV-2013] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan. Recognizing text with perspective distortion in natural scenes. In Proceedings of International Conference on Computer Vision (ICCV), pages 569–576, 2013. paper

[7] [ICLR-2014] O. Alsharif and J. Pineau, End-to-end text recognition with hybrid HMM maxout models. In Proceedings of International Conference on Learning Representations (ICLR), 2014. paper

[8] [TPAMI-2014] J. Almaz ́ an, A. Gordo, A. Forn ́ es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE Trans.Pattern Anal. Mach. Intell ., 36(12):2552–2566, 2014. paper code

[9] [CVPR-2014] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4042–4049, 2014. paper

[10] [IJCV-2015] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision (IJCV) , 113(3):193–207, 2015. paper

[11] [ECCV-2014] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In Proceedings of European Conference on Computer Vision (ECCV), pages 512–528, 2014. paper code

[12] [ACCV-2014] B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In Proceedings of Asian Conference on Computer Vision (ACCV), pages 35–48, 2014. paper

[13] [CVPR-2015] A. Gordo. Supervised mid-level features for word image representation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2956–2964, 2015. paper

[14] [IJCV-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. Int. J.Comput. Vision, 2015. paper code

[15] [ICLR-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Deep structured output learning for unconstrained text recognition. In Proceedings of International Conference on Learning Representations (ICLR), 2015. paper

[16] [TPAMI-2017] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304, 2017. paper code-Torch7 code-Pytorch

[17] [CVPR-2016] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4168–4176, 2016. paper

[18] [CVPR-2016] C.-Y. Lee and S. Osindero. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2231–2239, 2016. paper

[19] [BMVC-2016] W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han. STAR-Net: A spatial attention residue network for scene text recognition. In Proceedings of British Machine Vision Conference (BMVC), page 7, 2016. paper

[20] [IJCAI-2017] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 2017. paper

[21] [ICCV-2017] F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu. Scene text recognition with sliding convolutional character models. In Proceedings of International Conference on Computer Vision (ICCV), 2017. paper code

[22] [ICCV-2017] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of International Conference on Computer Vision (ICCV), pages 5086–5094, 2017. paper

[23] [CVPR-2018] Cheng Z, Xu Y, Bai F, et al. AON: Towards Arbitrarily-Oriented Text Recognition.In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 5571-5579, 2018. paper code

[24] [NC-2019] Gao Y, Chen Y, Wang J, et al. Reading scene text with fully convolutional sequence modeling[J]. Neurocomputing, 2019, 339: 161-170. paper

[25] [AAAI-2018] Liu W, Chen C, Wong K Y K. Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI) 2018. paper

[26] [AAAI-2018] Liu Z, Li Y, Ren F, et al. SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI) 2018. paper

[27] [CVPR-2018] Bai, F, Cheng, Z, Niu, Y, Pu, S and Zhou,S. Edit probability for scene text recognition. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1508–1516. paper

[28] [ECCV-2018] Liu Y, Wang Z, Jin H, et al. Synthetically Supervised Feature Learning for Scene Text Recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 2018: 435-451. paper

[29] [ICIP-2018] Gao Y, Chen Y, Wang J, et al. Dense Chained Attention Network for Scene Text Recognition. In Proceedings of International Conference on Image Processing (ICIP). IEEE, 2018: 679-683. paper

[30] [TPAMI-2018] Shi B, Yang M, Wang X, et al. Aster: An attentional scene text recognizer with flexible rectification[J]. IEEE transactions on pattern analysis and machine intelligence, 2018. paper code

[31] [CVPR-2012] A. Mishra, K. Alahari, and C. V. Jawahar. Top-down and bottom-up cues for scene text recognition. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2687–2694. paper

[32] https://github.com/Canjie-Luo/MORAN_v2

[33] [ICDAR-2005] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, ICDAR 2003 robust reading competitions. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), 2003, pp. 682–687. paper

[34] [ICDAR-2013] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras. ICDAR 2013 robust reading competition. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), 2013, pp. 1484–1493. paper

[35] [ICCV-2013] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In Proceedings of International Conference on Computer Vision (ICCV), 2013. paper

[36] [Expert Syst.Appl-2014] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41(18):8027–8048, 2014. paper

[37] [ICDAR-2015] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al. ICDAR 2015 competition on robust reading. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 1156–1160. paper

[38] [arXiv-2016] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” CoRR abs/1601.07140, 2016. paper code

[39] [ICDAR-2017] C. K. Ch’ng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceeding of International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 935–942. paper code

[40] [ICDAR-2017] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai. ICDAR 2017 competition on reading chinese text in the wild (RCTW-17). In Proceeding of International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1429–1434. paper

[41] [ICPR-2018] M. He, Y. Liu, Z. Yang, S. Zhang, C. Luo, F. Gao, Q. Zheng, Y. Wang, X. Zhang, and L. Jin. ICPR 2018 contest on robust reading for multi-type web images. In Proceedings of International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 7–12. paper

[42] [JCS&T-2019] Yuan T L, Zhu Z, Xu K, et al. A large chinese text dataset in the wild[J]. Journal of Computer Science and Technology, 2019, 34(3): 509-521. paper code

[43] [arXiv-2017] L. Yuliang, J. Lianwen, Z. Shuaitao, and Z. Sheng, “Detecting curve text in the wild: New dataset and new solution,” CoRR abs/1712.02170, 2017. paper code

[44] [ECCV-2018] Yao C, Wu W. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 2018: 71-88. paper code

[45] [NIPS-W-2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco,Bo Wu, and Andrew YNg. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011. paper

[46] [PR-2019] C. Luo, L. Jin, and Z. Sun, “MORAN: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, vol. 90, pp. 109–118, 2019. paper code

[47] [ACM-2019] Xie H, Fang S, Zha Z J, et al, “Convolutional Attention Networks for Scene Text Recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, pp. 3 2019. paper

[48] [AAAI-2019] Liao M, Zhang J, Wan Z, et al. Scene text recognition from two-dimensional perspective. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI). 2019. paper

[49] [AAAI-2019] Li H, Wang P, Shen C, et al. Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI) 2019. paper code

[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[52] [arXiv-2018] S. Long, X. He, and C. Ya, “Scene text detection and recognition: The deep learning era,” CoRR abs/1811.04256, 2018. paper

[53] [NIPS-W-2014] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, In Proceedings of Advances in Neural Information Processing Deep Learn. Workshop (NIPS-W).2014. paper code

[54] [CVPR-2016] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2315–2324. paper code

[55] [CVPR-2019] Zhan F, Lu S. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2059-2068. paper

[56] [CVPR-2019] Zhang Y, Nie S, Liu W, et al. Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2740-2749. paper code

[57][ICDAR-2019] Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas et al., “ICDAR 2019 competition on large-scale street view text with partial labeling–RRC-LSVT,” in Proceedings of ICDAR, 2019, pp. 1557–1562. paper Link

[58][ICDAR-2019] C.-K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding et al., “ICDAR2019 robust reading challenge on arbitrary-shaped text (RRC-ArT),” in Proceedings of ICDAR, 2019, pp. 1571–1576. paper Link

[59][ICDAR-2019] X. Liu, R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao et al., “ICDAR 2019 robust reading challenge on reading chinese text on signboard,” in Proceedings of ICDAR, 2019, pp. 1577–1581. paper Link

[60] [NC-2019] X. Chen, T. Wang, Y. Zhu, L. Jin, and C. Luo, “Adaptive embedding gate for attention-based scene text recognition,” Neurocomputing, 2019. paper


Newly added references (Dec 24, 2019)

[61] [ICFHR-2018] Wang C, Yin F, Liu C L. Memory-Augmented Attention Model for Scene Text Recognition[C] //2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, 2018: 62-67. paper

[62] [ICCV-2019] M. Yang, Y. Guan, M. Liao, X. He, K. Bian, S. Bai, C. Yao, and X. Bai, “Symmetry-constrained rectification network for scene text recognition,” In Proceedings of International Conference on Computer Vision (ICCV), 2019, pp. 9147–9156. paper

[63] [ICCV-2019] Y. Sun, J. Liu, W. Liu, J. Han, E. Ding, and J. Liu, “Chinese street view text: Large-scale chinese text reading with partially supervised learning,” In Proceedings of International Conference on Computer Vision (ICCV), pp. 9086–9095. paper

[64] [ICME-2019] Wang S, Wang Y, Qin X, et al. Scene Text Recognition via Gated Cascade Attention[C]//2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019: 1018-1023. paper

[65] [ICCV-2019] Baek J, Kim G, Lee J, et al. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of International Conference on Computer Vision (ICCV), 2019, pp: 4715-4723. paper code

[66] [Nips-2017] Wang J, Hu X. Gated recurrent convolution neural network for ocr[C]//Advances in Neural Information Processing Systems. 2017: 335-344. paper code

[67] [NC-2019] Huang, Yunlong, et al. "EPAN: Effective parts attention network for scene text recognition." Neurocomputing (2019). paper

[68] [NC-2019] Gao, Yunze, et al. "Reading scene text with fully convolutional sequence modeling." Neurocomputing 339 (2019): 161-170. paper

[69] [ICDAR-W-2019] Qi, Xianbiao, et al. "A Novel Joint Character Categorization and Localization Approach for Character-Level Scene Text Recognition." 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). Vol. 5. IEEE, 2019. paper

[70] [ICDAR-W-2019] Wang, Qingqing, et al. "ReELFA: A Scene Text Recognizer with Encoded Location and Focused Attention." 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). Vol. 5. IEEE, 2019. paper

[71] [ICIP-2019] Zhu, Yiwei, et al. "Text Recognition in Images Based on Transformer with Hierarchical Attention." 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019. paper

[72] [CVPR-2019] Zhan, Fangneng, Hongyuan Zhu, and Shijian Lu. "Spatial fusion gan for image synthesis." In Proceedings of Computer Vision and Pattern Recognition (CVPR). 2019. paper

[73] [ECCV-2018] Zhan, Fangneng, Shijian Lu, and Chuhui Xue. "Verisimilar image synthesis for accurate detection and recognition of texts in scenes."In Proceedings of the European Conference on Computer Vision (ECCV). 2018. paper code

[74] [MultiMedia-2018] Fang, Shancheng, et al. "Attention and language ensemble for scene text recognition with convolutional sequence modeling." 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018. paper code

[75] [Nips-2018] Liu, Hu, Sheng Jin, and Changshui Zhang. "Connectionist temporal classification with maximum entropy regularization." Advances in Neural Information Processing Systems. 2018. paper code

[76] [PR-2017] Su, Bolan, and Shijian Lu. "Accurate recognition of words in scenes without character segmentation using recurrent neural network." Pattern Recognition 63 (2017): 397-405. paper

[77] [CVIU-2016] Mishra, Anand, Karteek Alahari, and C. V. Jawahar. "Enhancing energy minimization framework for scene text recognition with top-down cues." Computer Vision and Image Understanding 145 (2016): 30-42. paper

[78] [ICPR-2016] Liu, Xinhao, et al. "Scene text recognition with CNN classifier and WFST-based word labeling." In Proceedings of International Conference on Pattern Recognition (ICPR). IEEE, 2016. paper

[79] [TPAMI-2019] Liao M, Lyu P, He M, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[J]. IEEE transactions on pattern analysis and machine intelligence, 2019. paper code

[80] [AAAI-2020] T. Wang, Y. Zhu, L. Jin, C. Luo and X. Chen. Decoupled Attention Network for Text Recognition. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2020. paper


Newly added references (Feb 29, 2020)

[81] [ICDAR-2019] N. Nayef, Y. Patel, M. Busta, P. N. Chowdhury, D. Karatzas, W. Khlif, J. Matas, U. Pal, J.-C. Burie, C.-l. Liu et al., “ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition–RRC-MLT-2019,” In Proceeding of International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1582–1587. paper

[82] [arXiv-2019] W. Wang, E. Xie, P. Sun, W. Wang, L. Tian, C. Shen, and P. Luo, “TextSR: Content-aware text super-resolution guided by recognition,” CoRR abs/1909.07113, 2019 . paper [code](https: //github.com/xieenze/TextSR )

[83] [AAAI-2020] Z. Wan, M. He, H. Chen, X. Bai, and C. Yao, “Textscanner: Reading characters in order for robust scene text recognition,” In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2020. paper

[84] [AAAI-2020] W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin, “GTC: Guided training of ctc towards efficient and accurate scene text recognition,” In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2020. paper

[85] [arXiv-2019] C. Luo, Q. Lin, Y. Liu, J. Lianwen, and S. Chunhua, “Separating content from style using adversarial learning for recognizing text in the wild,” CoRR abs/2001.04189, 2020. paper

6.Help

If you have any problem in our resources, or any good paper/code we missed, please inform us at xxuechen@foxmail.com. Thank you for your contribution.


7.Copyright

Copyright © 2019 SCUT-DLVC. All Rights Reserved.

Sample

About