Affective interaction recognition using spatio-temporal features and context

摘要

This paper focuses on recognizing the human interaction relative to human emotion, and addresses the problem of interaction features representation. We propose a two-layer feature description structure that exploits the representation of spatio-temporal motion features and context features hierarchically. On the lower layer, the local features for motion and interactive context are extracted respectively. We first characterize the local spatio-temporal trajectories as the motion features. Instead of hand-crafted features, a new hierarchical spatio-temporal trajectory coding model is presented to learn and represent the local spatio-temporal trajectories. To further exploit the spatial and temporal relationships in the interactive activities, we then propose an interactive context descriptor, which extracts the local interactive contours from frames. These contours implicitly incorporate the contextual spatial and temporal information. On the higher layer, semi-global features are represented based on the local features encoded on the lower layer. And a spatio-temporal segment clustering method is designed for features extraction on this layer. This method takes the spatial relationship and temporal order of local features into account and creates the mid-level motion features and mid-level context features. Experiments on three challenging action datasets in video, including HMDB51, Hollywood2 and UT-Interaction, are conducted. The results demonstrate the efficacy of the proposed structure, and validate the effectiveness of the proposed context descriptor.