Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning

作者：

Highlights：

•

摘要

Existing image paragraph captioning methods generate long paragraph captions solely from input images, relying on insufficient information. In this paper, we propose a retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning (RAMP), which makes full use of the R-best retrieved candidate captions to enhance the image paragraph captioning via adversarial training. Concretely, RAMP treats the retrieved captions as reference captions to augment the discriminator during adversarial training, encouraging the image captioning model (generator) to incorporate informative content in retrieved captions into the generated caption. In addition, a retrieval-enhanced dynamic memory-augmented attention network is devised to keep track of the coverage information and attention history along with the update-chain of the decoder state, and therefore avoiding generating repetitive or incomplete image descriptions. Finally, a copying mechanism is applied to select words from the retrieved candidate captions, which are then put into the proper positions of the target caption so as to improve the fluency and informativeness of the generated caption. Extensive experiments on a benchmark dataset (i.e., Stanford) demonstrate that the proposed RAMP model significantly outperforms the state-of-the-art methods across multiple evaluation metrics. For reproducibility, we submit the code and data at https://github.com/anonymous-caption/RAMP.

论文关键词：Image paragraph captioning,Key–value memory network,Adversarial training

论文评审过程：Received 26 July 2020, Revised 30 October 2020, Accepted 17 December 2020, Available online 30 December 2020, Version of Record 9 January 2021.

论文官网地址：https://doi.org/10.1016/j.knosys.2020.106730