We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model.
- Authors
Li, Guizhu; Fu, Min; Sun, Mengnan; Liu, Xuefeng; Zheng, Bing
- Abstract
The cocktail party problem can be more effectively addressed by leveraging the speaker's visual and audio information. This paper proposes a method to improve the audio's separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.
- Subjects
SPEECH; STREAMING video &; television; LIPS; COCKTAIL parties
- Publication
Sensors (14248220), 2023, Vol 23, Issue 21, p8770
- ISSN
1424-8220
- Publication type
Article
- DOI
10.3390/s23218770