Authors
Bing Su, Jiahuan Zhou, Xiaoqing Ding, Ying Wu
Description
Generally the evolution of an action is not uniform across the video, but exhibits quite complex rhythms and nonstationary dynamics. To model such non-uniform temporal dynamics, in this paper we describe a novel hierarchical dynamic parsing and encoding method to capture both the locally smooth dynamics and globally drastic dynamic changes. It parses the dynamics of an action into different layers and encodes such multi-layer temporal information into a joint representation for action recognition. At the first layer, the action sequence is parsed in an unsupervised manner into several smooth-changing stages corresponding to different key poses or temporal structures by temporal clustering. The dynamics within each stage are encoded by mean-pooling or rank-pooling. At the second layer, the temporal information of the ordered dynamics extracted from the previous layer is encoded again by rank-pooling to form the overall representation. Extensive experiments on a gesture action dataset (Chalearn Gesture) and two generic action datasets (Olympic Sports and Hollywood2) have demonstrated the effectiveness of the proposed method.