Video captioning using Semantically Contextual Generative Adversarial Network

作者:

Highlights:

摘要

In this work, we propose a Semantically Contextual Generative Adversarial Network (SC-GAN) for video captioning. The semantic features extracted from a video are used in the discriminator to weigh the word embedding vectors. The weighted word embedding vectors along with the visual features are used to discriminate the ground truth descriptions from the descriptions generated by the generator. The manager in the generator uses the features from the discriminator to generate a goal vector for the worker. The worker is trained using: a goal based reward and a semantics based reward in generating the description. The semantics based reward ensures that the worker generates descriptions that incorporate the semantic features. The goal based reward calculated from discriminator features ensures the generation of descriptions similar to the ground truth descriptions. We have used MSVD and MSR-VTT datasets to demonstrate the effectiveness of the proposed approach to video captioning.

论文关键词:

论文评审过程:Received 20 May 2021, Revised 17 March 2022, Accepted 11 May 2022, Available online 20 May 2022, Version of Record 26 May 2022.

论文官网地址:https://doi.org/10.1016/j.cviu.2022.103453