[Project] Image Captioning

Image Captioning = Language model to generate text conditioned on image

Language model = A common task of generating text using RNN or LSTM

Question - Where to insert the image information in the Language Model???

Answer - Based on the place the information is fed into the pipeline, there are 2 architectures:

1. INJECT 2. MERGE

In INJECT, the way the text and the image vectors are combined to pass to RNN gives 3 possibilities:

Analysis of the 2 architectures:

regenerates captions and less variation in the vocabulary
longer captions are more generic and less image-specific
requires more memory to store the image information in the RNN cells, which also implies more parameters and more training time

Since image features are not mixed with text features it shows more variations and produces less generic captions

REFERENCES:

Tanti, M., Gatt, A. and Camilleri, K.P., 2017. Where to put the image in an image caption generator. arXiv preprint arXiv:1703.09137.

Intuitive Shorts