Your institution may have access to this item. Find your institution then sign in to continue.

Title: A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model.
Authors: Li, Guizhu; Fu, Min; Sun, Mengnan; Liu, Xuefeng; Zheng, Bing
Abstract: The cocktail party problem can be more effectively addressed by leveraging the speaker's visual and audio information. This paper proposes a method to improve the audio's separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.
Subjects: SPEECH; STREAMING video &; television; LIPS; COCKTAIL parties
Publication: Sensors (14248220), 2023, Vol 23, Issue 21, p8770
ISSN: 1424-8220
Publication type: Article
DOI: 10.3390/s23218770

We found a match

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model.