We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition.
- Authors
Sun, Yaohui; Xu, Weiyao; Yu, Xiaoyi; Gao, Ju
- Abstract
Recent generation Microsoft Kinect Camera captures a series of multimodal signals that provide RGB video, depth sequences, and skeleton information, thus it becomes an option to achieve enhanced human action recognition performance by fusing different data modalities. However, most existing fusion methods simply fuse different features, which ignores the underlying semantics between different models, leading to a lack of accuracy. In addition, there exists a large amount of background noise. In this work, we propose a Vision Transformer-based Bilinear Pooling and Attention Network (VT-BPAN) fusion mechanism for human action recognition. This work improves the recognition accuracy in the following ways: 1) An effective two-stream feature pooling and fusion mechanism is proposed. The RGB frames and skeleton are fused to enhance the spatio-temporal feature representation. 2) A spatial lightweight multiscale vision Transformer is proposed, which can reduce the cost of computing. The framework is evaluated based on three widely used video action datasets, and the proposed approach performs a more comparable performance with the state-of-the-art methods.
- Subjects
HUMAN activity recognition; TRANSFORMER models; KINECT (Motion sensor); HUMAN skeleton; SKELETON
- Publication
Multimedia Tools & Applications, 2024, Vol 83, Issue 29, p73391
- ISSN
1380-7501
- Publication type
Article
- DOI
10.1007/s11042-023-17788-3