Qiao, Minglang; Liu, Yufan; Xu, Mai; Deng, Xin; Li, Bing; Hu, Weiming; Borji, Ali

doi:10.1007/s11263-023-01950-3

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Joint Learning of Audio–Visual Saliency Prediction and Sound Source Localization on Multi-face Videos.
Authors: Qiao, Minglang; Liu, Yufan; Xu, Mai; Deng, Xin; Li, Bing; Hu, Weiming; Borji, Ali
Abstract: Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multi-task learning method for audio–visual saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first introduce a large-scale database of multi-face video in visual-audio condition, containing eye-tracking data and sound source annotations. Using this database, we find that sound influences human attention, and conversely attention offers a cue to determine sound source on multi-face video. Guided by these findings, an audio–visual multi-task network (AVM-Net) is introduced to predict saliency and locate sound source. AVM-Net consists of three branches corresponding to visual, audio and face modalities. The visual branch has a two-stream architecture to capture spatial and temporal information. Face and audio branches encode audio signals and faces, respectively. Finally, a spatio-temporal multi-modal graph is constructed to model the interaction among multiple faces. With joint optimization of these branches, the intrinsic correlation of the tasks of saliency prediction and sound source localization is utilized and their performance is boosted by each other. Experiments show that the proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
Subjects: ACOUSTIC localization; TRANSMISSION of sound; DATABASES; CEREBRAL arteriovenous malformations; VIDEOS; FORECASTING; EYE tracking
Publication: International Journal of Computer Vision, 2024, Vol 132, Issue 6, p2003
ISSN: 0920-5691
Publication type: Article
DOI: 10.1007/s11263-023-01950-3

We found a match

Joint Learning of Audio–Visual Saliency Prediction and Sound Source Localization on Multi-face Videos.

Qiao, Minglang; Liu, Yufan; Xu, Mai; Deng, Xin; Li, Bing; Hu, Weiming; Borji, Ali

ACOUSTIC localization; TRANSMISSION of sound; DATABASES; CEREBRAL arteriovenous malformations; VIDEOS; FORECASTING; EYE tracking

International Journal of Computer Vision, 2024, Vol 132, Issue 6, p2003

0920-5691

Article

10.1007/s11263-023-01950-3