image-captioning

Implementations for image captioning models in PyTorch, different types of attention mechanisms supported. Currently only provides pretrained ResNet152 and VGG16 with batch normalization as encoders.

Model supported:
FC from "show and tell"
Att2all from "show and tell"
Att2in from "Self-critical Sequence Training for Image Captioning"
Spatial attention from "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"
Adaptive attention from "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"

Evaluate captions via capeval/, which is derived from tylin/coco-caption with minor changes for a better Python 3 support

Requirements

MSCOCO original dataset, please put them in the same directory, e.g. COCO2014/, and modify the COCO_ROOT in configs.py, you can get them here:
- train2014 images
- val2014 images
Instead of using random split, Karpathy's split is required, please put it in the COCO_PATH
PyTorch v0.3.1 or newer with GPU support.
TensorBoardX

Usage

1. Preprocessing

First of all we should preprocess the images and store them locally. Specifying phases is available if parallel processing is required. All preprocessed images are stored in HDF5 databases in COCO_ROOT

python preprocess.py

2. Extract image features

Extract the image features offline by the encoder and store them locally. Currently only ResNet152 and VGG16 with batch normalization are supported.

python extract.py --pretrained=resnet --batch_size=10 --gpu=0

3. Training the model

Training can be performed only after the image features are extracted. If training on the full dataset is desired, please specify the train_size as -1 Immediate evaluation with beam search after training is also available, please set the flag as true. The scores are stored in scores/

python train.py --train_size=100 --val_size=10 --test=10 --epoch=30 --verbose=10 --learning_rate=1e-3 --batch_size=10 --gpu=0 --pretrained=resnet --attention=none --evaluation=true

4. Offline evaluation

After the training is over, an offline evaluation can be performed. All generated captions are stored in results/

python evaluation.py --train_size=100 --test_size=10 --num=3 --batch_size=10 --gpu=10 --pretrained=resnet --attention=none --encoder=<path_to_encoder> --decoder=<path_to_decoder>

Note that the train_size must match the size of images for training

5. Visualize attention weights

For the model with attention.

python show_attention.py --phase=test --pretrained=resnet --train_size=-1 --val_size=-1 --test_size=-1 --num=10 --encoder=<path_to_encoder> --decoder=<path_to_decoder> --gpu=0

Results

Good captions

Okay captions

Bad captions

Attention

Good results

Bad results

Performance

Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	CIDEr
Baseline (Nearest neighbor)	0.48	0.281	0.166	0.1	0.383
FC	0.720	0.536	0.388	0.286	0.805
Att2in	0.732	0.553	0.402	0.296	0.837
Att2all	0.732	0.554	0.403	0.296	0.838
Spatial attention	0.725	0.537	0.389	0.287	0.812
Adaptive attention	0.716	0.524	0.379	0.278	0.808
NeuralTalk2	0.625	0.45	0.321	0.23	0.66
Show and Tell	0.666	0.461	0.329	0.27	-
Show, Attend and Tell	0.707	0.492	0.344	0.243	-
Adaptive Attention	0.742	0.580	0.439	0.266	1.085
Neural Baby Talk	0.755	-	-	0.347	1.072

best models:

Model train_size test_size learning_rate weight_decay batch_size beam_size dropout

FC -1 -1 2e-4 0 512 7 0

Att2in -1 -1 5e-4 1e-4 256 7 0

Att2all -1 -1 5e-4 1e-4 256 7 0

Spatial attention -1 -1 2e-4 1e-4 256 7 0

Adaptive attention -1 -1 2e-4 1e-4 256 7 0

yongsongH / image-captioning