Ran, Yue; Tang, Hongying; Li, Baoqing; Wang, Guohui

doi:10.3390/app122412622

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization.
Authors: Ran, Yue; Tang, Hongying; Li, Baoqing; Wang, Guohui
Abstract: Localizing the audio-visual events in video requires a combined judgment of visual and audio components. To integrate multimodal information, existing methods modeled the cross-modal relationships by feeding unimodal features into attention modules. However, these unimodal features are encoded in separate spaces, resulting in a large heterogeneity gap between modalities. Existing attention modules, on the other hand, ignore the temporal asynchrony between vision and hearing when constructing cross-modal connections, which may lead to the misinterpretation of one modality by another. Therefore, this paper aims to improve event localization performance by addressing these two problems and proposes a framework that feeds audio and visual features encoded in the same semantic space into a temporally adaptive attention module. Specifically, we develop a self-supervised representation method to encode features with a smaller heterogeneity gap by matching corresponding semantic cues between synchronized audio and visual signals. Furthermore, we develop a temporally adaptive cross-modal attention based on a weighting method that dynamically channels attention according to the time differences between event-related features. The proposed framework achieves state-of-the-art performance on the public audio-visual event dataset and the experimental results not only show that our self-supervised method can learn more discriminative features but also verify the effectiveness of our strategy for assigning attention.
Subjects: SEMANTICS; VIDEOS; HETEROGENEITY
Publication: Applied Sciences (2076-3417), 2022, Vol 12, Issue 24, p12622
ISSN: 2076-3417
Publication type: Article
DOI: 10.3390/app122412622

We found a match

Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization.

Ran, Yue; Tang, Hongying; Li, Baoqing; Wang, Guohui

SEMANTICS; VIDEOS; HETEROGENEITY

Applied Sciences (2076-3417), 2022, Vol 12, Issue 24, p12622

2076-3417

Article

10.3390/app122412622