Skip to content

Partial re-implementation of Image Captioning with Deep Bidirectional LSTMs

License

Notifications You must be signed in to change notification settings

sulabhkatiyar/Bi-LSTM

Repository files navigation

Bi-LSTM

Partial re-implementation of Image Captioning with Deep Bidirectional LSTMs

Note: this is a work in progress. I will upload results with other datasets and a detailed explanation soon.

Introduction:

This model is similar to Bi-LSTM model proposed in Image Captioning with Deep Bidirectional LSTMs published in 24th ACM international conference on Multimedia [link] . An updated paper was published in journal ACM TOMM [link]. There are following differences in our implementation:

  1. I have not used Data Augmentation in this implementation. However, I have included options for horizontal and vertical data augmentation in the code which can be used by setting use_data_augmentation = True in train.py.
  2. I have used batch size of 32 for all experiments and learning rate of 0.0001.
  3. I have used VGG-16 CNN for image feature extraction whereas the authors used both AlexNet and VGG-16 for experiments.
  4. Since both forward and backward LSTMs are trained for caption generation, I have experimented with both the inference strategy used in the paper (where the most likely sentence generated by forward or backward LSTMs is used as caption) and separate inference with backward and forward LSTMs.
  5. I could not find hidden and cell state initialization in the paper. So, I have initilized hidden and cell states as zero vectors for both forward and backward LSTMs.

Method

I have used two-layered Bi-Directional LSTM as described in paper. The Text-LSTM (T-LSTM) takes as input, the word vector representations. In my implementation, the output of Text-LSTM (T-LSTM) and Image Feature representation are concatenated together before being used as input to Multimodal-LSTM (M-LSTM). It is mentioned in the paper that M-LSTM uses both image representation and T-LSTM hidden state but it's not clear to me how both quantities are used. So I have merged them by concatenation and fed them as input to M-LSTM. I have evaluated with 1, 3, 5, 10, 15 and 20 as beam sizes. In the paper, authors have used greedy evaluation with beam size as 1.

In the paper, both forward and backward LSTMs are trained to generate captions and their losses are combined. During evaluation the captions generated by forward and backward LSTM are evaluated and most likely caption is selected at each time-step. In our implementation, we save captions generated by both forward and backward LSTMs separately and also record caption generated by overall model (i.e., the most likely caption, from backward and forward LSTMs, recorded at each time-step.)

Results

For Flickr8k dataset:

The following table contains results obtained from overall model (best captions selected from forward and backward LSTMs):

Result Beam BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr SPICE ROUGE-L
Paper 1 0.655 0.468 0.320 0.215 __ __ __ __
Our 1 0.632 0.436 0.286 0.181 0.193 0.455 0.127 0.441
Our 3 0.602 0.418 0.277 0.179 0.174 0.454 0.124 0.425
Our 5 0.583 0.403 0.269 0.176 0.169 0.453 0.121 0.420
Our 10 0.563 0.394 0.265 0.171 0.165 0.421 0.118 0.421
Our 15 0.546 0.380 0.254 0.160 0.162 0.414 0.117 0.413
Our 20 0.535 0.371 0.247 0.155 0.158 0.406 0.116 0.409

Following table contains results obtained using captions generated by forward LSTM only:

Beam BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr SPICE ROUGE-L
1 0.624 0.415 0.262 0.163 0.189 0.380 0.120 0.434
3 0.583 0.389 0.244 0.146 0.161 0.341 0.110 0.410
5 0.571 0.381 0.238 0.141 0.156 0.336 0.105 0.403
10 0.549 0.363 0.226 0.135 0.150 0.317 0.103 0.397
15 0.529 0.351 0.216 0.125 0.146 0.305 0.102 0.390
20 0.520 0.346 0.216 0.127 0.146 0.310 0.103 0.389

Following table contains results obtained using captions generated by backward LSTM only:

Beam BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr SPICE ROUGE-L
1 0.634 0.437 0.289 0.184 0.201 0.465 0.134 0.444
3 0.599 0.413 0.272 0.174 0.177 0.450 0.124 0.423
5 0.584 0.404 0.270 0.177 0.172 0.445 0.122 0.421
10 0.568 0.399 0.269 0.176 0.168 0.434 0.120 0.422
15 0.549 0.384 0.256 0.163 0.165 0.424 0.118 0.415
20 0.539 0.374 0.249 0.159 0.160 0.415 0.117 0.411

Reproducing the results:

  1. Download 'Karpathy Splits' for train, validation and testing from here.
  2. For evaluation, the model already generates BLEU scores. In addition, it saves results and image annotations as needed in MSCOCO evaluation format. So for generation of METEOR, CIDEr, ROUGE-L and SPICE evaluation metrics, the evaluation code can be downloaded from here.

Prerequisites:

  1. This code has been tested on python 3.6.9 but should word on all python versions > 3.6.
  2. Pytorch v1.5.0
  3. CUDA v10.1
  4. Torchvision v0.6.0
  5. Numpy v.1.15.0
  6. pretrainedmodels v0.7.4 (Install from source). (I think all versions will work but I have listed here for the sake of completeness.)

Execution:

  1. First set the path to Flickr8k/Flickr30k/MSCOCO data folders in create_input_files_dataname.py file ('dataname' replaced by f8k/f30k/coco).
  2. Create processed dataset by running:

python create_input_files_dataname.py

  1. To train the model:

python train_dataname.py

  1. To evaluate:

python eval_dataname.py beamsize

(eg.: python train_f8k.py 20)

About

Partial re-implementation of Image Captioning with Deep Bidirectional LSTMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages