Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

作者：Hegui Zhu, Ru Wang, Xiangde Zhang

摘要

In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from \(91.2 \%\) to \(106.1 \%\), which has higher performance than the comparable models and verifies the effectiveness of the proposed model.

论文关键词：Image captioning, Masked convolution, Dense fusion connection, Improved stacked attention module

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-021-10431-y