Hae Sung Park; Yong Suk Choi

doi:10.32604/csse.2023.043245

Back to matches

Your institution may have rights to this item. Sign in to continue.

Title: TEAM: Transformer Encoder Attention Module for Video Classification.
Authors: Hae Sung Park; Yong Suk Choi
Abstract: Much like humans focus solely on object movement to understand actions, directing a deep learning model's attention to the core contexts within videos is crucial for improving video comprehension. In the recent study, VideoMasked Auto-Encoder (VideoMAE) employs a pre-training approach with a high ratio of tube masking and reconstruction, effectively mitigating spatial bias due to temporal redundancy in full video frames. This steers the model's focus toward detailed temporal contexts. However, as the VideoMAE still relies on full video frames during the action recognition stage, it may exhibit a progressive shift in attention towards spatial contexts, deteriorating its ability to capture the main spatio-temporal contexts. To address this issue, we propose an attention-directing module named Transformer Encoder Attention Module (TEAM). This proposed module effectively directs the model's attention to the core characteristics within each video, inherently mitigating spatial bias. The TEAM first figures out the core features among the overall extracted features fromeach video. After that, it discerns the specific parts of the video where those features are located, encouraging themodel to focusmore on these informative parts. Consequently, during the action recognition stage, the proposed TEAM effectively shifts the VideoMAE's attention from spatial contexts towards the core spatio-temporal contexts. This attention-shift manner alleviates the spatial bias in the model and simultaneously enhances its ability to capture precise video contexts. We conduct extensive experiments to explore the optimal configuration that enables the TEAM to fulfill its intended design purpose and facilitates its seamless integrationwith theVideoMAE framework. The integratedmodel, i.e., VideoMAE+TEAM, outperforms the existing VideoMAE by a significant margin on Something-Something-V2 (71.3% vs. 70.3%). Moreover, the qualitative comparisons demonstrate that theTEAMencourages themodel to disregard insignificant features and focus more on the essential video features, capturing more detailed spatio-temporal contexts within the video.
Subjects: VIDEOS; DEEP learning; TRANSFORMER models; FEATURE extraction; SOCIAL context
Publication: Computer Systems Science & Engineering, 2024, Vol 48, Issue 2, p451
ISSN: 0267-6192
Publication type: Article
DOI: 10.32604/csse.2023.043245

We found a match

TEAM: Transformer Encoder Attention Module for Video Classification.

Hae Sung Park; Yong Suk Choi

VIDEOS; DEEP learning; TRANSFORMER models; FEATURE extraction; SOCIAL context

Computer Systems Science & Engineering, 2024, Vol 48, Issue 2, p451

0267-6192

Article

10.32604/csse.2023.043245