Awesome Text VQA

Text related VQA is a fine-grained direction of the VQA task, which only focuses on the question that requires to read the textual content shown in the input image.

Datasets

VisualMRC dataset (AAAI 2021) [Project][Paper]
EST-VQA dataset (CVPR 2020) [Project][Paper]
DOC-VQA dataset (CVPR Workshop 2020) [Project][Paper]
Text-VQA dataset (CVPR 2019) [Project][Paper]
ST-VQA dataset (ICCV 2019) [Project][Paper]
OCR-VQA dataset (ICDAR 2019) [Project][Paper]

Dataset	#Train+Val Img	#Train+Val Que	#Test Img	#Test Que	Image Source	Language
Text-VQA	25,119	39,602	3,353	5,734	[1]	EN
ST-VQA	19,027	26,308	2,993	4,163	[2, 3, 4, 5, 6, 7, 8]	EN
OCR-VQA	186,775	901,717	20,797	100,429	[9]	EN
EST-VQA	17,047	19,362	4,000	4,525	[4, 5, 8, 10, 11, 12, 13]	EN+CH
DOC-VQA	11,480	44,812	1,287	5,188	[14]	EN
VisualMRC	7,960	23,854	2,237	6,708	self-collected webpage screenshot	EN

Image Source:
[1] OpenImages: A public dataset for large-scale multi-label and multi-class image classification (v3) [dataset]
[2] Imagenet: A large-scale hierarchical image database [dataset]
[3] Vizwiz grand challenge: Answering visual questions from blind people [dataset]
[4] ICDAR 2013 robust reading competition [dataset]
[5] ICDAR 2015 competition on robust reading [dataset]
[6] Visual Genome: Connecting language and vision using crowdsourced dense image annotations [dataset]
[7] Image retrieval using textual cues [dataset]
[8] Coco-text: Dataset and benchmark for text detection and recognition in natural images [dataset]
[9] Judging a book by its cover [dataset]
[10] Total Text [dataset]
[11] SCUT-CTW1500 [dataset]
[12] MLT [dataset]
[13] Chinese Street View Text [dataset]
[14] UCSF Industry Document Library [dataset]

Related Challenges

ICDAR 2021 COMPETITION On Document Visual Question Answering (DocVQA) Submission Deadline: 31st March 2021 [Challenge]
Document Visual Question Answering （CVPR 2020 Workshop on Text and Documents in the Deep Learning Era Submission Deadline: ~~30 April 2020~~ [Challenge]

Papers

2021

[VisualMRC] VisualMRC: Machine Reading Comprehension on Document Images (AAAI) [Paper][Project]
[SSBaseline] Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps (AAAI) [Paper][code]

2020

[SA-M4C] Spatially Aware MultimodalTransformers for TextVQA (ECCV) [Paper][Project][Code]
[EST-VQA] On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering (CVPR) [Paper]
[M4C] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA (CVPR) [Paper][Project]
[LaAP-Net] Finding the Evidence: Localization-aware Answer Prediction for TextVisual Question Answering (COLING) [Paper]
[CRN] Cascade Reasoning Network for Text-basedVisual Question Answering (ACM MM) [Paper][Project]

2019

[Text-VQA/LoRRA] Towards VQA Models That Can Read (CVPR) [Paper][Code]
[ST-VQA] Scene Text Visual Question Answering (ICCV) [Paper]
[Text-KVQA] From Strings to Things: Knowledge-enabled VQA Modelthat can Read and Reason (ICCV) [Paper]
[OCR-VQA] OCR-VQA: Visual Question Answering by Reading Text in Images (ICDAR) [Paper]

Technical Reports

[TAP] TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [Report]
[RUArt] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering [Report]
[SMA] Structured Multimodal Attentions for TextVQA [Report][Slides][Video]
[DiagNet] DiagNet: Bridging Text and Image [Report][Code]
[DCD_ZJU] Winner of 2019 Text-VQA challenge [Slides]
[Schwail] Runner-up of 2019 Text-VQA challenge [Slides]

Benchmark

Acc. : Accuracy I. E. : Image Encoder Q. E. : Question Encoder O. E. : OCR Token Encoder Ensem. : Ensemble

Text-VQA

[official leaderboard(2019)] [official leaderboard(2020)]

Y-C./J.	Methods	Acc.	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2019--CVPR	LoRRA	26.64	Faster R-CNN	GloVe	Rosetta-ml	FastText	Classification	N
2019--N/A	DCD_ZJU	31.44	Faster R-CNN	BERT	Rosetta-ml	FastText	Classification	Y
2020--CVPR	M4C	40.46	Faster R-CNN (ResNet-101)	BERT	Rosetta-en	FastText	Decoder	N
2020--Challenge	Xiangpeng	40.77
2020--Challenge	colab_buaa	44.73
2020--Challenge	CVMLP(SAM)	44.80
2020--Challenge	NWPU_Adelaide_Team(SMA)	45.51	Faster R-CNN	BERT	BDN	Graph Attention	Decoder	N
2020--ECCV	SA-M4C	44.6*	Faster R-CNN (ResNext-152)	BERT	Google-OCR	FastText+PHOC	Decoder	N
2020--arXiv	TAP	53.97*	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N

* Using external data for training.

ST-VQA

[official leaderboard]
T1 : Strongly Contextualised Task T2 : Weakly Contextualised Task T3 : Open Dictionary

Y-C./J.	Methods	Acc. (T1/T2/T3)	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2020--CVPR	M4C	na/na/0.4621	Faster R-CNN (ResNet-101)	BERT	Rosetta-en	FastText	Decoder	N
2020--Challenge	SMA	0.5081/0.3104/0.4659	Faster	BERT	BDN	Graph Attention	Decoder	N
2020--ECCV	SA-M4C	na/na/0.5042	Faster R-CNN (ResNext-152)	BERT	Google-OCR	FastText+PHOC	Decoder	N
2020--arXiv	TAP	na/na/0.5967	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N

OCR-VQA

Y-C./J.	Methods	Acc.	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2020--CVPR	M4C	63.9	Faster R-CNN	BERT	Rosetta-en	FastText	Decoder	N

lmhlll / Awesome-Text-VQA