Awesome Visual Grounding

A curated list of research papers in grounding. Link to the code if available is also present.

Have a look at SCOPE.md to get familiar with what grounding means and the tasks considered in this repository.

To maintaing the quality of the repo, I have gone through all the listed papers at least once before adding them to ensure their relevance to grounding. However, I might have missed some paper(s) or added some irrelevant paper(s). Feel free to open an issue in that case. I will go through the paper and then add / remove it.

Contributing
Demos
Other Compilations
Datasets
Paper Roadmap (Chronological Order):

Contributing

Feel free to contact me via email (ark.sadhu2904@gmail.com) or open an issue or submit a pull request. To add a new paper via pull request:

Fork the repo, change readme. Put the new paper under the correct heading, and place it at the correct chronological position.
Copy its reference in MLA format
Put ** around the title
Provide link to the paper (arxiv/semantic scholar/conference proceedings).
If code or website exists, link that too.
Send a pull request. Ideally, I will review the request within a week.

Demos

MATTNet demo: http://vision2.cs.unc.edu/refer/comprehension

Other Compilations:

Shoutout to some other awesome stuff on vision and language grounding:

Multi-modal Reading List by Paul Liang (@pliang279) : https://github.com/pliang279/awesome-multimodal-ml/
Temporal Grounding by Mu Ketong (@iworldtong): https://github.com/iworldtong/Awesome-Grounding-Natural-Language-in-Video
Temporal Grounding by WuJie (@WuJie1010): https://github.com/WuJie1010/Awesome-Temporally-Language-Grounding. Also, checkout their implementation of some of the popular papers: https://github.com/WuJie1010/Temporally-language-grounding

Datasets

Object-Sentence Grounding Datasets

RefClef: Kazemzadeh, Sahar, et al. Referitgame: Referring to objects in photographs of natural scenes. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. [Paper] [Website]
RefCOCO and RefCOCO+: 1. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]
RefCOCOg: Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Clevr-ref+: Liu, Runtao, et al. Clevr-ref+: Diagnosing visual reasoning with referring expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code] [Website]
Visual Genome: Krishna, Ranjay, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123.1 (2017): 32-73. [Paper] [Website]
Ref-Reasoning: Sibei Yang, Guanbin Li, Yizhou Yu. Graph-Structured Referring Expression Reasoning in The Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020. [Paper] [Website]

Object-phrase Grounding Datasets

Flickr30k: Plummer, Bryan A., et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Proceedings of the IEEE international conference on computer vision. 2015. [Paper] [Code] [Website]

Guess What Datasets

GuessWhat: De Vries, Harm, et al. Guesswhat?! visual object discovery through multi-modal dialogue. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. [Paper] [Code] [Website]

Video Grounding Datasets

TaCoS: Regneri, Michaela, et al. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics 1 (2013): 25-36. [Paper] [Website]
Charades: Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Website]
Charades-STA: Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017).[Paper] [Code]
Distinct Describable Moments (DiDeMo): Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code] [Website]
ActivityNet Captions: Krishna, Ranjay, et al. Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision. 2017. [Paper] [Website]
Charades-Ego: [Website]
- Sigurdsson, Gunnar, et al. Actor and Observer: Joint Modeling of First and Third-Person Videos. CVPR-IEEE Conference on Computer Vision & Pattern Recognition. 2018. [Paper] [Code]
- Sigurdsson, Gunnar A., et al. "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos." arXiv preprint arXiv:1804.09626 (2018). [Paper] [Code]
TEMPO: Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]
ActivityNet-Entities: Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Embodied Agents Platforms:

Matterport3D: Chang, Angel, et al. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 (2017). [Paper] [Code] [Website]
- Photorealistic rooms
AI2-THOR: Kolve, Eric, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017). [Paper] [Website]
- Actionable objects!
Habitat AI: Savva, Manolis, et al. Habitat: A platform for embodied ai research. Proceedings of the IEEE International Conference on Computer Vision. 2019. (ICCV 2019) [Paper] [Website]
ALFRED: Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox, ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks [Paper] [Code] [Project Page]

Visual Grounding / Referring Expressions (Images):

Karpathy, Andrej, Armand Joulin, and Li F. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems. 2014. [Paper]
Karpathy, Andrej, and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Method name: Neural Talk. [Paper] [Code] [Torch Code] [Website]
Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]
Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Wang, Liwei, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]
Nagaraja, Varun K., Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. European Conference on Computer Vision. Springer, Cham, 2016.[Paper] [Code]
Rohrbach, Anna, et al. Grounding of textual phrases in images by reconstruction. European Conference on Computer Vision. Springer, Cham, 2016. Method Name: GroundR [Paper] [Tensorflow Code] [Torch Code]
Wang, Mingzhe, et al. Structured matching for phrase localization. European Conference on Computer Vision. Springer, Cham, 2016. Method name: Structured Matching [Paper] [Code]
Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Website]
Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP (2016). Method name: MCB [Paper][Code]
Endo, Ko, et al. An attention-based regression model for grounding textual phrases in images. Proc. IJCAI. 2017. [Paper]
Chen, Kan, et al. MSRC: Multimodal spatial regression with semantic context for phrase grounding. International Journal of Multimedia Information Retrieval 7.1 (2018): 17-28. [Paper -Springer Link]
Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]
Yu, Licheng, et al. A joint speakerlistener-reinforcer model for referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code][Website]
Hu, Ronghang, et al. Modeling relationships in referential expressions with compositional modular networks. Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [Paper] [Code]
Luo, Ruotian, and Gregory Shakhnarovich. Comprehension-guided referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code]
Liu, Jingyu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. Proceedings of CVPR. 2017. [Paper]
Plummer, Bryan A., et al. Phrase localization and visual relationship detection with comprehensive image-language cues. Proc. ICCV. 2017. [Paper] [Code]
Chen, Kan, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: QRC [Paper] [Code]
Liu, Chenxi, et al. Recurrent Multimodal Interaction for Referring Image Segmentation. ICCV. 2017. [Paper] [Code]
Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]
Li, Xiangyang, and Shuqiang Jiang. Bundled Object Context for Referring Expressions. IEEE Transactions on Multimedia (2018). [Paper ieee link]
Yu, Zhou, et al. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. arXiv preprint arXiv:1805.03508 (2018). [Paper] [Code]
Yu, Licheng, et al. Mattnet: Modular attention network for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. [Paper] [Code] [Website]
Deng, Chaorui, et al. Visual Grounding via Accumulated Attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper]
Li, Ruiyu, et al. Referring image segmentation via recurrent refinement networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper] [Code]
Zhang, Yundong, Juan Carlos Niebles, and Alvaro Soto. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. arXiv preprint arXiv:1808.00265 (2018). [Paper]
Zhang, Hanwang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code]
Cirik, Volkan, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. arXiv preprint arXiv:1805.10547 (2018).[Paper] [Code]
Margffoy-Tuay, Edgar, et al. Dynamic multimodal instance segmentation guided by natural language queries. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]
Shi, Hengcan, et al. Key-word-aware network for referring expression image segmentation. Proceedings of the European Conference on Computer Vision (ECCV). 2018.[Paper] [Code]
Plummer, Bryan A., et al. Conditional image-text embedding networks. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]
Akbari, Hassan, et al. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv preprint arXiv:1811.11683 (2018). [Paper]
Kovvuri, Rama, and Ram Nevatia. PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding. arXiv preprint arXiv:1812.03213 (2018). [Paper]
Liu, Daqing, et al. Learning to Assemble Neural Module Tree Networks for Visual Grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019. [Paper] [Code]
Chen, Xinpeng, et al. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. arXiv preprint arXiv:1812.03426 (2018). [Paper]
Wang, Peng, et al. Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks. arXiv preprint arXiv:1812.04794 (2018). [Paper]
RETRACTED (see #2): Deng, Chaorui, et al. You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding. arXiv preprint arXiv:1902.04213 (2019). [Paper]
Hong, Richang, et al. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). 2019. [Paper]
Liu, Xihui, et al. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]
Dogan, Pelin, Leonid Sigal, and Markus Gross. Neural Sequential Phrase Grounding (SeqGROUND). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Fang, Zhiyuan, et al. Modularized textual grounding for counterfactual resilience. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Ye, Linwei, et al. Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Yang, Sibei, Guanbin Li, and Yizhou Yu. Cross-Modal Relationship Inference for Grounding Referring Expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Yang, Sibei, Guanbin Li, and Yizhou Yu. Dynamic Graph Attention for Referring Expression Comprehension. arXiv preprint arXiv:1909.08164 (2019). (ICCV 2019) [Paper] [Code]
Yang, Zhengyuan, et al. A Fast and Accurate One-Stage Approach to Visual Grounding. arXiv preprint arXiv:1908.06354 (2019). (ICCV 2019) [Paper] [Code]
Chen, Yi-Wen, et al. Referring Expression Object Segmentation with Caption-Aware Consistency. arXiv preprint arXiv:1910.04748 (2019). (BMVC 2019) [Paper] [Code]
Liu, Jiacheng, and Julia Hockenmaier. Phrase Grounding by Soft-Label Chain Conditional Random Field. arXiv preprint arXiv:1909.00301 (2019) (EMNLP 2019). [Paper] [Code]
Liu, Yongfei, Wan Bo, Zhu Xiaodan and He Xuming. Learning Cross-modal Context Graph for Visual Grounding. arXiv preprint arXiv: (2019) (AAAI-2020). [Paper] [Code]
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang, More Grounded Image Captioning by Distilling Image-Text Matching Model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020. [Paper] [Code]
Sibei Yang, Guanbin Li, Yizhou Yu, Graph-Structured Referring Expressions Reasoning in The Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020. [Paper] [Code]

Weakly Supervised Visual Grounding

Xiao, Fanyi, Leonid Sigal, and Yong Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371 (2017). [Paper]
Chen, Kan, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. arXiv preprint arXiv:1803.03879 (2018). [Paper] [Code]
Datta, Samyak, et al. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. arXiv preprint arXiv:1903.11649 (2019). (ICCV 2019) [Paper]
Wang, Josiah, and Lucia Specia. Phrase Localization Without Paired Training Examples. arXiv preprint arXiv:1908.07553 (2019). (ICCV 2019) [Paper] [Code]
Liu, Xuejing, et al. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. arXiv preprint arXiv:1908.10568 (2019). (ICCV 2019) [Paper] [Code]

Zero-shot Visual Grounding

Sadhu, Arka, Kan Chen, and Ram Nevatia. Zero-Shot Grounding of Objects from Natural Language Queries. arXiv preprint arXiv:1908.07129 (2019).(ICCV 2019) [Paper] [Code]

Natural Language Object Retrieval (Images)

Guadarrama, Sergio, et al. Open-vocabulary Object Retrieval. Robotics: science and systems. Vol. 2. No. 5. 2014. [Paper] [Code]
Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]
Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]
Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]
Nguyen, Anh, et al. Object Captioning and Retrieval with Natural Language. arXiv preprint arXiv:1803.06152 (2018). [Paper] [Website]
Plummer, Bryan A., et al. Open-vocabulary Phrase Detection. arXiv preprint arXiv:1811.07212 (2018). [Paper] [Code]

Grounding Relations / Referring Relations

Krishna, Ranjay, et al. Referring relationships. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code] [Website]
Raboh, Moshiko et al. Differentiable Scene Graphs. (2019). [Paper]
Conser, Erik, et al. Revisiting Visual Grounding. arXiv preprint arXiv:1904.02225 (2019). [Paper]
- Critique of Referring Relationship paper

Video Grounding using Natural Language:

Yu, Haonan, et al. Grounded Language Learning from Video Described with Sentences Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2013. [Paper]
Xu, Ran, et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. Proceedings of the AAAI Conference on Artificial Intelligence. 2015. [Paper]
Song, Young Chol, et al. Unsupervised Alignment of Actions in Video with Text Descriptions Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2016. [Paper]
Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017). Method name: TALL [Paper] [Code]
Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code]
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. Video Object Segmentation with Language Referring Expressions. arXiv preprint arXiv:1803.08006 (2018). [Paper] [Website]
Xu, Huijuan, et al. Joint Event Detection and Description in Continuous Video Streams. arXiv preprint arXiv:1802.10250 (2018). [Paper] [Code]
Xu, Huijuan, et al. Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning. arXiv preprint arXiv:1804.05113 (2018). [Paper] [Code]
Liu, Bingbin, et al. Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos. European Conference on Computer Vision. Springer, Cham, 2018. [Paper] [Website]
Liu, Meng, et al. Attentive Moment Retrieval in Videos. Proceedings of the International ACM SIGIR Conference . 2018. [Paper] [Website]
Chen, Jingyuan, et al. Temporally grounding natural sentence in video. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. [Paper]
Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]
Wu, Aming, and Yahong Han. Multi-modal Circulant Fusion for Video-to-Language and Backward. IJCAI. Vol. 3. No. 4. 2018. [Paper] [Code]
Ge, Runzhou, et al. MAC: Mining Actiivity Concepts for Language-based Temporal Localization. arXiv preprint arXiv:1811.08925 (2018). [Paper] [Code]
Zhang, Da, et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. arXiv preprint arXiv:1812.00087 (2018). [Paper]
He, Dongliang, et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. Proceedings of the AAAI Conference on Artificial Intelligence. 2019. [Paper]
Wang, Weining, Yan Huang, and Liang Wang. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]
Ghosh, Soham, et al. ExCL: Extractive Clip Localization Using Natural Language Descriptions. arXiv preprint arXiv:1904.02755 (2019). [Paper]
Chen, Shaoxiang, and Yu-Gang Jiang. Semantic Proposal For Activity Localization In Videos Via Sentence Query. Proceedings of the AAAI Conference on Artificial Intelligence. 2019.[Paper]
Yuan Y, Mei T, Zhu W. To Find Where You Talk: Temporal Sentence Localization In Video With Attention Based Location Regression. Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 9159-9166. [Paper]
Mithun N C, Paul S, Roy-Chowdhury A K. Weakly Supervised Video Moment Retrieval From Text Queries. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 11592-11601.[Paper]
Escorcia, Victor, et al. Temporal Localization of Moments in Video Collections with Natural Language. arXiv preprint arXiv:1907.12763 (2019). (ICCV 2019) [Paper] [Code]
Wang, Jingwen, Lin Ma, and Wenhao Jiang. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. arXiv preprint arXiv:1909.05010 (2019). (AAAI 2020) [Paper] [Code]
Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, Chuang Gan, Dense Regression Network for Video Grounding Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020 [Paper]
Mihir Prabhudesai, Hsiao-Yu Fish Tung, Syed Ashar Javed, et al. Embodied Language Grounding With 3D Visual Feature Representations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020. [Paper]
Arka Sadhu, Kan Chen, Ram Nevatia, Video Object Grounding Using Semantic Roles in Language Description Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020. [Paper] [Code]
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao, Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020. [Paper] [Dataset]
Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman, Visual Grounding in Video for Unsupervised Word Translation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2020. [Paper] [Code]

Grounded Description (Image) (WIP)

Hendricks, Lisa Anne, et al. Generating visual explanations. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Pytorch Code]
Jiang, Ming, et al. TIGEr: Text-to-Image Grounding for Image Caption Evaluation. arXiv preprint arXiv:1909.02050 (2019). (EMNLP 2019) [Paper] [Code]
Lee, Jason, Kyunghyun Cho, and Douwe Kiela. Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499 (2019). (EMNLP 2019) [Paper]

Grounded Description (Video) (WIP)

Ma, Chih-Yao, et al. Grounded Objects and Interactions for Video Captioning. arXiv preprint arXiv:1711.06354 (2017). [Paper]
Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Visual Grounding Pretraining

Sun, Chen, et al. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766 (2019). [Paper]
Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv preprint arXiv:1908.02265 (Neurips 2019) [Paper] [Code]
Li, Liunian Harold, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019). [Paper] [Code]
Li, Gen, et al. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019). [Paper]
Tan, Hao, and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019). [Paper] [Code]
Su, Weijie, et al. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019). [Paper]
Chen, Yen-Chun, et al. UNITER: Learning UNiversal Image-TExt Representations. arXiv preprint arXiv:1909.11740 (2019). [Paper]

Grounding for Embodied Agents (WIP):

Shridhar, Mohit, et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. arXiv preprint arXiv:1912.01734 (2019). [Paper] [Code] [Website]

Misc:

Han, Xudong, Philip Schulz, and Trevor Cohn. Grounding learning of modifier dynamics: An application to color naming. arXiv preprint arXiv:1909.07586 (2019). (EMNLP 2019) [Paper] [Code]
Yu, Xintong, et al. What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues. arXiv preprint arXiv:1909.00421 (2019). (EMNLP 2019) [Paper] [Code]

bareblackfoot / awesome-grounding