We found a match
Your institution may have rights to this item. Sign in to continue.
- Title
面向域外说话人适应场景的多层级解耦个性化语音合成.
- Authors
高盛祥; 杨元樟; 王琳钦; 莫尚斌; 余正涛; 董凌
- Abstract
Personalized speech synthesis aims to generate speech with specific speaker’s characteristics. Traditional approaches often exhibit noticeable timbre disparities when synthesizing speech from unseen speakers, making it challenging to disentangle speaker-specific timbre features. This paper proposes a multi-level disentangled personalized speech synthesis approach designed for out-of-domain speakers. By fusing features at different granularities, the proposed method effectively enhances the performance of synthesizing speech from unseen speakers under zero-resource conditions. This is achieved by utilizing fast Fourier convolution to extract global speaker features, thereby enhancing the model's generalization to unseen speakers and enabling sentence-level speaker decoupling. Additionally, leveraging a speech recognition model, the method decouples speaker features at the phoneme level and captures phoneme-level timbre features through an attention mechanism, achieving phoneme-level speaker disentanglement. Experimental results on the publicly available dataset AISHELL3 demonstrate that the proposed approach achieves a cosine similarity of 0.697 for speaker feature vectors of cross-speaker adaptation, indicating a 6.25% improvement compared with the baseline model. This enhancement shows the method’s capability in modeling timbre features for speech from unseen speakers in cross-speaker adaptation scenarios.
- Subjects
SPEECH synthesis; PHONEME (Linguistics); GENERALIZATION; SPEECH perception
- Publication
Journal of Guangxi Normal University - Natural Science Edition, 2024, Vol 42, Issue 4, p11
- ISSN
1001-6600
- Publication type
Article
- DOI
10.16088/j.issn.1001-6600.2023111303