Chen, Congping; Zhang, Chunsheng; Dong, Xin

doi:10.3390/app14031061

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: A Multi-Scale Video Longformer Network for Action Recognition.
Authors: Chen, Congping; Zhang, Chunsheng; Dong, Xin
Abstract: Action recognition has found extensive applications in fields such as video classification and security monitoring. However, existing action recognition methods, such as those based on 3D convolutional neural networks, often struggle to capture comprehensive global information. Meanwhile, transformer-based approaches face challenges associated with excessively high computational complexity. We introduce a Multi-Scale Video Longformer network (MSVL), built upon the 3D Longformer architecture featuring a "local attention + global features" attention mechanism, enabling us to reduce computational complexity while preserving global modeling capabilities. Specifically, MSVL gradually reduces the video feature resolution and increases the feature dimensions across four stages. In the lower layers of the network (stage 1, stage 2), we leverage local window attention to alleviate local redundancy and computational demands. Concurrently, global tokens are employed to retain global features. In the higher layers of the network (stage 3, stage 4), this local window attention evolves into a dense computation mechanism, enhancing overall performance. Finally, extensive experiments are conducted on UCF101 (97.6%), HMDB51 (72.9%), and the assembly action dataset (100.0%), demonstrating the effectiveness and efficiency of the MSVL.
Subjects: CONVOLUTIONAL neural networks; VIDEO monitors; VIDEO surveillance; SECURITY classification (Government documents); RECOGNITION (Psychology); COMPUTATIONAL complexity; ELECTRIC transformers
Publication: Applied Sciences (2076-3417), 2024, Vol 14, Issue 3, p1061
ISSN: 2076-3417
Publication type: Article
DOI: 10.3390/app14031061

We found a match

A Multi-Scale Video Longformer Network for Action Recognition.

Chen, Congping; Zhang, Chunsheng; Dong, Xin

CONVOLUTIONAL neural networks; VIDEO monitors; VIDEO surveillance; SECURITY classification (Government documents); RECOGNITION (Psychology); COMPUTATIONAL complexity; ELECTRIC transformers

Applied Sciences (2076-3417), 2024, Vol 14, Issue 3, p1061

2076-3417

Article

10.3390/app14031061