Recognizing actions in images by fusing multiple body structure cues

Highlights：

• We design two body structure cues, SBPs and LAD, to fully explore the structure information of human bodies from the local and global perspectives.

• In order to construct body parts with different scales in unconstrained images, we propose a technique to use human keypoint heatmaps to generate scale adaptive SBPs, which extract fine-grained local human features. Moreover, we propose a technique to automatically determine the most discriminative body part of each action category for identifying the ongoing action. In order to extract global hightlevel body structure features, we propose the LAD to model the spatial angle relationship of pairs of human limbs. The LAD is more robust and achieves better performance compared with the distance based skeleton descriptor.

• We evaluate our model on two challenging image-based action datasets, and the results show that our method achieves the state-of-the-art performance.

摘要

•We propose a unified model for recognizing human actions in static images. It explicitly investigates the body structure information as well as integrates the body structure exploration and action classification tasks into a unified model. Moreover, we design a twostep learning technique, where keypoint estimation provides intermediate supervision for learning human action representations.•We design two body structure cues, SBPs and LAD, to fully explore the structure information of human bodies from the local and global perspectives.•In order to construct body parts with different scales in unconstrained images, we propose a technique to use human keypoint heatmaps to generate scale adaptive SBPs, which extract fine-grained local human features. Moreover, we propose a technique to automatically determine the most discriminative body part of each action category for identifying the ongoing action. In order to extract global hightlevel body structure features, we propose the LAD to model the spatial angle relationship of pairs of human limbs. The LAD is more robust and achieves better performance compared with the distance based skeleton descriptor.•We evaluate our model on two challenging image-based action datasets, and the results show that our method achieves the state-of-the-art performance.