A New Attention-Based LSTM for Image Captioning

作者：Fen Xiao, Wenfeng Xue, Yanqing Shen, Xieping Gao

摘要

Image captioning aims to describe the content of an image with a complete and natural sentence. Recently, the image captioning methods with encoder-decoder architecture has made great progress, in which LSTM became a dominant decoder to generate word sequence. However, in the decoder stage, the input vector keep same and there is much uncorrelated with previously visual parts or generated words. In this paper, we propose an attentional LSTM (ALSTM) and show how to integrate it within state-of-the-art automatic image captioning framework. Instead of traditional LSTM in existing models, ALSTM learns to refine input vector from network hidden states and sequential context information. Thus ALSTM can attend more relevant features such as spatial attention, visual relations and pay more attention on the most relevant context words. Moreover, ALSTM is utilized as the decoder in some classical frameworks and shows how to get effective visual/context attention to update input vector. Extensive quantitative and qualitative evaluations on the Flickr30K and MSCOCO image datasets with modified network illustrate the superiority of ALSTM. ALSTM based methods can generate high quality descriptions by combining sequence context and relations.

论文关键词：Image caption, Attention, Long short-term memory, Deep learning

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-022-10759-z