[Project] Image Captioning

Image Captioning = Language model to generate text conditioned on image

Language model = A common task of generating text using RNN or LSTM

Question - Where to insert the image information in the Language Model???

Answer -  Based on the place the information is fed into the pipeline, there are 2 architectures:

1. INJECT   2. MERGE

In INJECT, the way the text and the image vectors are combined to pass to RNN gives 3 possibilities:



                 




Analysis of the 2 architectures:
  1. INJECT:
    • regenerates captions and less variation in the vocabulary
    • longer captions are more generic and less image-specific
    • requires more memory to store the image information in the RNN cells, which also implies more parameters and more training time
  2. MERGE:
    • Since image features are not mixed with text features it shows more variations and produces less generic captions

REFERENCES:
Tanti, M., Gatt, A. and Camilleri, K.P., 2017. Where to put the image in an image caption generator. arXiv preprint arXiv:1703.09137.


Comments