Wang, Hao; Zhao, Bin; Zhang, Wenjia; Liu, Guohua

doi:10.1049/ipr2.12876

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: LGANet: Local and global attention are both you need for action recognition.
Authors: Wang, Hao; Zhao, Bin; Zhang, Wenjia; Liu, Guohua
Abstract: Due to redundancy in the spatiotemporal neighborhood and the global dependency between video frames, video recognition remains a challenge. Some prior works have been mainly driven by 3D convolutional neural networks (CNNs) or 2D CNNs with a well‐designed module for temporal information. However, convolution‐based networks lack the capability to capture the global dependency due to the limited receptive field. Alternatively, transformer for video recognition is proposed to build long‐range dependency between frame patches. Nevertheless, most transformer‐based networks have significant computational costs because attention is calculated among all the tokens. Based on these observations, we propose an efficient network which we dub LGANet. Unlike conventional CNNs and transformers for video recognition, the LGANet can tackle both spatiotemporal redundancy and dependency by learning local and global token affinity in shallow and deep layers, respectively. Specifically, local attention is implemented in the shallow layers to reduce parameters and eliminate redundancy. In the deep layers, spatial‐wise and channel‐wise self‐attention are embedded to realize the global dependency of high‐level features. Moreover, several key designs are made in the multi‐head self‐attention (MSA) and feed‐forward network (FFN). Extensive experiments are conducted on the popular video benchmarks, such as Kinetics‐400, Something‐Something V1&V2. Without any bells and whistles, the LGANet achieves state‐of‐the‐art performance. The code will be released soon.
Subjects: CONVOLUTIONAL neural networks; TRANSFORMER models; VIDEO compression; RECOGNITION (Psychology)
Publication: IET Image Processing (Wiley-Blackwell), 2023, Vol 17, Issue 12, p3453
ISSN: 1751-9659
Publication type: Article
DOI: 10.1049/ipr2.12876

We found a match

LGANet: Local and global attention are both you need for action recognition.

Wang, Hao; Zhao, Bin; Zhang, Wenjia; Liu, Guohua

CONVOLUTIONAL neural networks; TRANSFORMER models; VIDEO compression; RECOGNITION (Psychology)

IET Image Processing (Wiley-Blackwell), 2023, Vol 17, Issue 12, p3453

1751-9659

Article

10.1049/ipr2.12876