This code provides a TensorFlow implementation for Visual Dialog.
The code includes two evaluation protocols:
- round level: history ground-truth answers are provided for the current question.
- dialog level: history ground-truth answers are not available.
TensorFlow >= 1.4. Installation instruction are as follows:
pip install --user tensorflow-gpu
Download and unzip VisDial dataset:
wget https://computing.ece.vt.edu/~abhshkdz/data/visdial/visdial_0.9_train.zip
wget https://computing.ece.vt.edu/~abhshkdz/data/visdial/visdial_0.9_val.zip
unzip visdial_0.9_train.zip
unzip visdial_0.9_val.zip
Dowload COCO dataset from http://cocodataset.org/#download.
Use the codes under data/ to preprocess data:
- prepro.py: preprocess captions, questions, answers and dialog information.
- resnet152_img.py: extract ResNet-152 feature. ResNet-152 checkpoint should be downloaded at first.
- vgg16_img.py: extract VGG-16 feature. VGG-16 checkpoint should be ownloaded at first.
The LF-G-round-level model can reproduce the numbers in the Visual Dialog paper.
The AttProc-G-dialog-level model implement a recurrent attentive network for Visual Dialog. The network includes a dialog network, which is an LSTM, to memorize the temporal context of the dialog history and an attentive processor, which is another LSTM integrated with an attention mechanism, to parse the visual
spatial context. The dialog network triggers a question signal that includes the question and dialog history information. Then the signal is passed to the attentive processor. Guided by the signal, the attentive processor grounds the question in the image by iteratively glimpsing the visual content multiple times. Lastly, a state vector is generated by incorporating the multiple glimpses and passed to the decoder to generate (generative model) or select an answer (discriminative model).