Vision Language Warehouse

Bridging visual modalities and natural language is a interesting yet challenging task. It attracts more and more research highlights and requires interdisciplinary efforts from Computer Vision, Natural Language Processing and Machine Learning.

This repository contains recent papers, projects and materials on Image Captioning, Text-Image Matching and Text-to-Image Generation.

Image captioning

Template-based methods

VIsual TRAnslator: Linking perceptions and natural language descriptions PDF

Learning visually grounded words and syntax for a scene description task PDF

Every picture tells a story: Generating sentences from images PDF

Babytalk: Understanding and generating simple image descriptions PDF

Deep-learning-based approaches

Show and Tell: A Neural Image Caption Generator (CVPR2015) PDF

Deep Visual-Semantic Alignments for Generating Image Descriptions (CVPR2015) PDF code site

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (ICML2015) PDF code site

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks (NIPS2015) PDF

Areas of Attention for Image Captioning (ICCV2017) PDF

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning (CVPR2017) PDF code

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning (CVPR2017) PDF code

Self-critical Sequence Training for Image Captioning (CVPR2017) PDF

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning (AAAI2018) PDF code

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (CVPR2018) PDF code

Convolutional Image Captioning (CVPR2018) PDF code

Rethinking the Form of Latent States in Image Captioning (ECCV2018) PDF code

Recurrent Fusion Network for Image Captioning (ECCV2018) PDF

Materials

GitHub repositories

pytorch-tutorial/image_captioning

ruotianluo/ImageCaptioning.pytorch

tylin/coco-caption

alecwangcq/show-attend-and-tell

sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

daveredrum/image-captioning

Docs

Deep Visual-Semantic Alignments for Generating Image Descriptions

Automated Image Captioning

Caption this, with TensorFlow

Soft & hard attention

Text-Image Matching

Cross-modal Retrieval with Correspondence Autoencoder (ACMMM2014) PDF

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models (arXiv 2014) PDF

Multimodal Convolutional Neural Networks for Matching Image and Sentence (ICCV2015) PDF

Identity-Aware Textual-Visual Matching with Latent Co-attention (ICCV2017) PDF

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM (CVPR2017) PDF

Deep Cross-Modal Projection Learning for Image-Text Matching (ECCV2018) PDF

End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss (JMIR2018) PDF

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models (CVPR2018) PDF

Text-to-Image Generation

Generating Images From Captions with Attention (ICLR2016) PDF code

Learning What and Where to Draw (NIPS2016) PDF code

Generative Adversarial Text to Image Synthesis (ICML2016) PDF code

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks (ICCV2017) PDF code

ChatPainter: Improving Text to Image Generation using Dialogue (arXiv 2018) PDF

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks (CVPR2018) PDF Code code

Text2Scene: Generating Abstract Scenes from Textual Descriptions (arXiv2018) PDF

daveredrum / vision_language