Sun, Yaohui; Xu, Weiyao; Yu, Xiaoyi; Gao, Ju

doi:10.1007/s11042-023-17788-3

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition.
Authors: Sun, Yaohui; Xu, Weiyao; Yu, Xiaoyi; Gao, Ju
Abstract: Recent generation Microsoft Kinect Camera captures a series of multimodal signals that provide RGB video, depth sequences, and skeleton information, thus it becomes an option to achieve enhanced human action recognition performance by fusing different data modalities. However, most existing fusion methods simply fuse different features, which ignores the underlying semantics between different models, leading to a lack of accuracy. In addition, there exists a large amount of background noise. In this work, we propose a Vision Transformer-based Bilinear Pooling and Attention Network (VT-BPAN) fusion mechanism for human action recognition. This work improves the recognition accuracy in the following ways: 1) An effective two-stream feature pooling and fusion mechanism is proposed. The RGB frames and skeleton are fused to enhance the spatio-temporal feature representation. 2) A spatial lightweight multiscale vision Transformer is proposed, which can reduce the cost of computing. The framework is evaluated based on three widely used video action datasets, and the proposed approach performs a more comparable performance with the state-of-the-art methods.
Subjects: HUMAN activity recognition; TRANSFORMER models; KINECT (Motion sensor); HUMAN skeleton; SKELETON
Publication: Multimedia Tools & Applications, 2024, Vol 83, Issue 29, p73391
ISSN: 1380-7501
Publication type: Article
DOI: 10.1007/s11042-023-17788-3

We found a match

VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition.

Sun, Yaohui; Xu, Weiyao; Yu, Xiaoyi; Gao, Ju

HUMAN activity recognition; TRANSFORMER models; KINECT (Motion sensor); HUMAN skeleton; SKELETON

Multimedia Tools & Applications, 2024, Vol 83, Issue 29, p73391

1380-7501

Article

10.1007/s11042-023-17788-3