Weakly supervised learning of actions from transcripts

作者:

Highlights:

摘要

We present an approach for weakly supervised learning of human actions from video transcriptions. Our system is based on the idea that, given a sequence of input data and a transcript, i.e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream and to learn the related action models without the need for any frame-based annotation. Starting from the transcript information at hand, we split the given data sequences uniformly based on the number of expected actions. We then learn action models for each class by maximizing the probability that the training video sequences are generated by the action models given the sequence order as defined by the transcripts. The learned model can be used to temporally segment an unseen video with or without transcript. Additionally, the inferred segments can be used as a starting point to train high-level fully supervised models.We evaluate our approach on four distinct activity datasets, namely Hollywood Extended, MPII Cooking, Breakfast and CRIM13. It shows that the proposed system is able to align the scripted actions with the video data, that the learned models localize and classify actions in the datasets, and that they outperform any current state-of-the-art approach for aligning transcripts with video data.

论文关键词:

论文评审过程:Received 15 September 2016, Revised 29 May 2017, Accepted 10 June 2017, Available online 20 June 2017, Version of Record 23 November 2017.

论文官网地址:https://doi.org/10.1016/j.cviu.2017.06.004