Image Captioning = Language model to generate text conditioned on image
Language model = A common task of generating text using RNN or LSTM
Question - Where to insert the image information in the Language Model???
Answer - Based on the place the information is fed into the pipeline, there are 2 architectures:
1. INJECT 2. MERGE
In INJECT, the way the text and the image vectors are combined to pass to RNN gives 3 possibilities:
Analysis of the 2 architectures:
- INJECT:
- regenerates captions and less variation in the vocabulary
- longer captions are more generic and less image-specific
- requires more memory to store the image information in the RNN cells, which also implies more parameters and more training time
- MERGE:
- Since image features are not mixed with text features it shows more variations and produces less generic captions
REFERENCES:
Tanti, M., Gatt, A. and Camilleri, K.P., 2017. Where to put the image in an image caption generator. arXiv preprint arXiv:1703.09137.
Comments
Post a Comment