Controllable image caption with an encoder-decoder optimization structure

作者:Jie Shao, Runxia Yang

摘要

Controllable image caption, which belongs to the intersection of Computer Vision (CV) and Natural Language Process (NLP), is an important part of applying artificial intelligence to many life scenes. We adopt an encoder-decoder structure, which considers visual models as the encoder and regards language models as the decoder. In this work, we introduce a new feature extraction model, namely FVC R-CNN, to learn both the salient features and the visual commonsense features. Furthermore, a novel MT-LSTM neural network for sentence generation is proposed, which is activated by m-tanh and is superior to the traditional Long Short-term memory Network (LSTM) by a significant margin. Finally, we put forward a multi-branch decision strategy to optimize the output. The experimental results are conducted on the widely used COCO Entities dataset, which demonstrates that the proposed method simultaneously outperforms the baseline, surpassing the state-of-the-art methods under a wide range of evaluation metrics. There are CIDEr and SPICE respectively achieves 206.3 and 47.6, yield state-of-the-art (SOTA) performance.

论文关键词:Controllable image caption, M-tanh activation function, MT-LSTM neural network, FVC R-CNN model

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-021-02988-x