Applied Sciences, Vol. 15, Pages 4511: GCN-Former: A Method for Action Recognition Using Graph Convolutional Networks and Transformer
Applied Sciences doi: 10.3390/app15084511
Authors:
Xueshen Cui
Jikai Zhang
Yihao He
Zhixing Wang
Wentao Zhao
Skeleton-based action recognition, which aims to classify human actions through the coordinates of body joints and their connectivity, is a significant research area in computer vision with broad application potential. Although Graph Convolutional Networks (GCNs) have made significant progress in processing skeleton data represented as graphs, their performance is constrained by local receptive fields and fixed joint connection patterns. Recently, researchers have introduced Transformer-based methods to overcome these limitations and better capture long-range dependencies. However, these methods face significant computational resource challenges when attempting to capture the correlations between all joints across all frames. This paper proposes an innovative Spatio-Temporal Graph Convolutional Network: GCN-Former, which aims to enhance model performance in skeleton-based action recognition tasks. The model integrates the Transformer architecture with traditional GCNs, leveraging the Transformer’s powerful capability for handling long-sequence data and the effective capture of spatial dependencies by GCNs. Specifically, this study designs a Transformer Block temporal encoder based on the self-attention mechanism to model long-sequence temporal actions. The temporal encoder can effectively capture long-range dependencies in action sequences while retaining global contextual information in the temporal dimension. In addition, in order to achieve a smooth transition from graph convolutional networks (GCNs) to Transformers, we further develop a contextual temporal attention (CTA) module. These components are aimed at enhancing the understanding of temporal and spatial information within action sequences. Experimental validation on multiple benchmark datasets demonstrates that our approach not only surpasses existing techniques in prediction accuracy, but also has significant performance advantages in handling action recognition tasks involving long time sequences and can more effectively capture and understand long-range dependencies in complex action patterns.
Source link
Xueshen Cui www.mdpi.com