We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
A Multi-Scale Video Longformer Network for Action Recognition.
- Authors
Chen, Congping; Zhang, Chunsheng; Dong, Xin
- Abstract
Action recognition has found extensive applications in fields such as video classification and security monitoring. However, existing action recognition methods, such as those based on 3D convolutional neural networks, often struggle to capture comprehensive global information. Meanwhile, transformer-based approaches face challenges associated with excessively high computational complexity. We introduce a Multi-Scale Video Longformer network (MSVL), built upon the 3D Longformer architecture featuring a "local attention + global features" attention mechanism, enabling us to reduce computational complexity while preserving global modeling capabilities. Specifically, MSVL gradually reduces the video feature resolution and increases the feature dimensions across four stages. In the lower layers of the network (stage 1, stage 2), we leverage local window attention to alleviate local redundancy and computational demands. Concurrently, global tokens are employed to retain global features. In the higher layers of the network (stage 3, stage 4), this local window attention evolves into a dense computation mechanism, enhancing overall performance. Finally, extensive experiments are conducted on UCF101 (97.6%), HMDB51 (72.9%), and the assembly action dataset (100.0%), demonstrating the effectiveness and efficiency of the MSVL.
- Subjects
CONVOLUTIONAL neural networks; VIDEO monitors; VIDEO surveillance; SECURITY classification (Government documents); RECOGNITION (Psychology); COMPUTATIONAL complexity; ELECTRIC transformers
- Publication
Applied Sciences (2076-3417), 2024, Vol 14, Issue 3, p1061
- ISSN
2076-3417
- Publication type
Article
- DOI
10.3390/app14031061