jakezhaojb / neural-vqa

VIS+LSTM model for Visual Question Answering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This is an experimental Torch implementation of the VIS + LSTM visual question answering model from the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros & Richard Zemel.

Model architecture



Download the MSCOCO train+val images and VQA data using sh data/download_data.sh. If you have them downloaded, copy over the train2014 and val2014 image folders and VQA JSON files to the data folder.

Download the VGG-19 Caffe model and prototxt using sh models/download_models.sh.

Known issues

  • To avoid memory issues with LuaJIT, install Torch with vanilla Lua. More instructions here.
  • If working with plain Lua, luaffifb may be needed for loadcaffe, unless using pre-extracted fc7 features.


Extract image features

th extract_fc7.lua -split train
th extract_fc7.lua -split val


  • batch_size: Batch size. Default is 10.
  • split: train/val. Default is train.
  • gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.
  • proto_file: Path to the deploy.prototxt file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers_deploy.prototxt.
  • model_file: Path to the .caffemodel file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers.caffemodel.
  • data_dir: Data directory. Default is data.
  • feat_layer: Layer to extract features from. Default is fc7.
  • input_image_dir: Image directory. Default is data.


th train.lua


  • rnn_size: Size of LSTM internal state. Default is 1024.
  • embedding_size: Size of word embeddings. Default is 200.
  • learning_rate: Learning rate. Default is 1e-4.
  • learning_rate_decay: Learning rate decay factor. Default is 0.95.
  • learning_rate_decay_after: In number of epochs, when to start decaying the learning rate. Default is 10.
  • decay_rate: Decay rate for RMSProp. Default is 0.95.
  • batch_size: Batch size. Default is 64.
  • max_epochs: Number of full passes through the training data. Default is 50.
  • dropout: Dropout for regularization. Probability of dropping input. Default is 0.5.
  • init_from: Initialize network parameters from checkpoint at this path.
  • save_every: No. of iterations after which to checkpoint. Default is 1000.
  • train_fc7_file: Path to fc7 features of training set. Default is data/train_fc7.t7.
  • fc7_image_id_file: Path to fc7 image ids of training set. Default is data/train_fc7_image_id.t7.
  • val_fc7_file: Path to fc7 features of validation set. Default is data/val_fc7.t7.
  • val_fc7_image_id_file: Path to fc7 image ids of validation set. Default is data/val_fc7_image_id.t7.
  • data_dir: Data directory. Default is data.
  • checkpoint_dir: Checkpoint directory. Default is checkpoints.
  • savefile: Filename to save checkpoint to. Default is vqa.
  • gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.


th predict.lua -checkpoint_file checkpoints/lr1e-4b64_epoch17.25_0.5063.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question 'What is the cat on?'


  • checkpoint_file: Path to model checkpoint to initialize network parameters fro
  • input_image_path: Path to input image
  • question: Question string

Sample predictions

Randomly sampled image-question pairs from the VQA test set, and answers predicted by the VIS+LSTM model.

Q: What animals are those?
A: Sheep

Q: What color is the frisbee that's upside down?
A: Red

Q: What is flying in the sky?
A: Kite

Q: What color is court?
A: Blue

Q: What is in the standing person's hands?
A: Bat

Q: Are they riding horses both the same color?
A: No

Q: What shape is the plate?
A: Round

Q: Is the man wearing socks?
A: Yes

Q: What is over the woman's left shoulder?
A: Fork

Q: Where are the pink flowers?
A: On wall

Implementation Details

  • Last hidden layer image features from VGG-19
  • GloVe 200d word embeddings as question features
  • Zero-padded question sequences for batched implementation
  • Training questions are filtered for top_n answers, top_n = 1000 by default (~87% coverage)

Pretrained model and data files

To reproduce results shown on this page or try your own image-question pairs, download the following and run predict.lua with the appropriate paths.


VIS+LSTM model for Visual Question Answering


Language:Lua 99.1%Language:Shell 0.9%