EvArEST

Everyday Arabic-English Scene Text dataset, from the paper: Arabic Scene Text Recognition in the Deep Learning Era: Analysis on A Novel Dataset

Detection Dataset

The text detection dataset has 510 images all containing one or more instances of text. Each word is annotated with a four-point polygon that starts with the top left corner of the polygon and follows clockwise. Each image comes with a text file containing three attributes: the four points of the polygon that contains the word, the language of the word.

Training Data

Test Data

Recognition Dataset

The text recognition dataset comprises of 7232 cropped word images of both Arabic and English languages. The groundtruth for the recognition dataset is provided by a text file with each line containing the image file name and the text in the image. The dataset could be used for Arabic text recognition only and could be used for bilingual text recognition.

Training Data:

Arabic- English

Test Data:

Arabic- English

Synthetic Data

About 200k synthetic images with segmentation maps.

SynthData

Other Resources for Arabic Data

ICDAR 2019 Robust Reading Challenge on Multi-lingual scene text detection and recognition

https://rrc.cvc.uab.es/?ch=15&com=tasks

Synthetic MLT Data

https://github.com/MichalBusta/E2E-MLT

Citation

If you find this dataset useful for your research, please cite

@article{hassan2021arabic,
  title={Arabic Scene Text Recognition in the Deep Learning Era: Analysis on A Novel Dataset},
  author={Hassan, Heba and El-Mahdy, Ahmed and Hussein, Mohamed E},
  journal={IEEE Access},
  year={2021},
  publisher={IEEE}
}