Image caption model of double LSTM with scene factors

作者：

Highlights：

•

摘要

In this paper, an image semantic understanding model combining scene factors is proposed to solve the problem that the accuracy rate of the description sentence is low in the current image semantic understanding model which is incorrect or ignores the scene recognition. This model first identifies the corresponding theme (scene information) through the text volume of the LDA analysis corpus. We get the vocabulary used in this scene. Then we use the ResNet to extract the global feature of the image, and use the Places365-CNNs to extract the feature of the deep scene. Finally, the model uses the picture scene information and the corpus scene information. In the description statement of the picture generation, it uses the words related to the picture scene in large probability and in the statement. In the process of generation, double LSTM is used to adjust the parameters to improve the accuracy of statement generation. This model is trained and tested in the Flickr8K, Flickr30K and MSCOCO image sets. The model is verified with different evaluation methods. The experimental results show that the proposed model can effectively improve the image language compared with other models. The accuracy of meaning understanding can solve these problems effectively.

论文关键词：Image caption,Deep neural network,Scene recognition,Semantic information

论文评审过程：Received 31 October 2018, Accepted 5 March 2019, Available online 2 April 2019, Version of Record 1 May 2019.

论文官网地址：https://doi.org/10.1016/j.imavis.2019.03.003