Skip to content

shulavkarki/Image-Caption-Generator

Repository files navigation

Image-Caption-Generator

Image Caption Generator using CNN and LSTM

Dataset:

31.8k flickr dataset

Technologies Used:

  1. Pytorch
  2. Python
  3. Spacy

Model Used:

1. CNN

image

  • Differnent Pre-trained cnn module(resnet-50, vgg, etc) were used for the feature extractor.

2. LSTM

image

  • It act as Decoder for decoding the feature vector generated by cnn module to corresponding context word.

Training:

image

  • Last layer of CNN module is removed, and fully connected layer is added that results in the feature vector of size (eg., 256). If batch_size=8, the ouput from cnn module will be of shape (8, 256)
  • For each target word it produces 256 length of embedding by passing through the embedding layer. Here the sentence of max_length=40 is used. so the output of embedding layer will be (8, 40, 256), considering batch_size=8.
  • The feature_vector from cnn_module and output of embedding_layer is concatenated to result in (8, 41, 256). This input shape is passed to the LSTM cell, which produces the 256 length of hidden_state and cell_state. After the, fc-layer is used to map the 256 length of feature vector to vocab_size=7500+(around). The ouput shape should be (8, 40, 7500+). considering, vocab_size = 7500+
  • This above process occurs for the 40th time step, cause lstm process the sequence word by word.
  • The training happens end-to-end.

Inference:

  • The Image passes through CNN module to generate feature vector of size 256.

  • This feature vector passes to lstm cell.

  • The lstm results on the probability distribution of words in vocab_size.

  • Loops for 40(max_length), until <end> token is found.
    -- The embedding of ouput word is then passed as input to lstm cell.

    Result:
    Image1

Releases

No releases published

Packages

No packages published