Dense semantic embedding network for image captioning

Highlights：

• A Densely Semantic Embedding Network (DSEN) is constructed, which is able to embed the attributes into the DSE-LSTM with each of other inputs at each step of word generation.

• An enhancement to the representation of the inputs like image feature, text feature and the hidden state by the modulation of the attributes.

• An activation function is proposed to compose together the attributes. Typically, it is designed as a Threshold ReLU (TReLU). With this TReLU, the attributes can be modulated to be sparser with enough discriminative power.

• The comprehensive evaluations demonstrate the effectiveness of our method for both image captioning and image-text cross modal retrieval tasks.

摘要

•A Densely Semantic Embedding Network (DSEN) is constructed, which is able to embed the attributes into the DSE-LSTM with each of other inputs at each step of word generation.•An enhancement to the representation of the inputs like image feature, text feature and the hidden state by the modulation of the attributes.•An activation function is proposed to compose together the attributes. Typically, it is designed as a Threshold ReLU (TReLU). With this TReLU, the attributes can be modulated to be sparser with enough discriminative power.•The comprehensive evaluations demonstrate the effectiveness of our method for both image captioning and image-text cross modal retrieval tasks.

论文评审过程：Received 21 April 2018, Revised 2 December 2018, Accepted 24 January 2019, Available online 31 January 2019, Version of Record 5 February 2019.