We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Joint Learning of Audio–Visual Saliency Prediction and Sound Source Localization on Multi-face Videos.
- Authors
Qiao, Minglang; Liu, Yufan; Xu, Mai; Deng, Xin; Li, Bing; Hu, Weiming; Borji, Ali
- Abstract
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multi-task learning method for audio–visual saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first introduce a large-scale database of multi-face video in visual-audio condition, containing eye-tracking data and sound source annotations. Using this database, we find that sound influences human attention, and conversely attention offers a cue to determine sound source on multi-face video. Guided by these findings, an audio–visual multi-task network (AVM-Net) is introduced to predict saliency and locate sound source. AVM-Net consists of three branches corresponding to visual, audio and face modalities. The visual branch has a two-stream architecture to capture spatial and temporal information. Face and audio branches encode audio signals and faces, respectively. Finally, a spatio-temporal multi-modal graph is constructed to model the interaction among multiple faces. With joint optimization of these branches, the intrinsic correlation of the tasks of saliency prediction and sound source localization is utilized and their performance is boosted by each other. Experiments show that the proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
- Subjects
ACOUSTIC localization; TRANSMISSION of sound; DATABASES; CEREBRAL arteriovenous malformations; VIDEOS; FORECASTING; EYE tracking
- Publication
International Journal of Computer Vision, 2024, Vol 132, Issue 6, p2003
- ISSN
0920-5691
- Publication type
Article
- DOI
10.1007/s11263-023-01950-3