We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Mandarin-English code-switching speech corpus in South-East Asia: SEAME.
- Authors
Lyu, Dau-Cheng; Tan, Tien-Ping; Chng, Eng-Siong; Li, Haizhou
- Abstract
This paper introduces the South East Asia Mandarin-English corpus, a 63-h spontaneous Mandarin-English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.
- Subjects
CHINA; MANDARIN dialects -- Study &; teaching; CHINESE dialects; PHONEME (Linguistics); LEXICAL access; WORD recognition; CHINESE language
- Publication
Language Resources & Evaluation, 2015, Vol 49, Issue 3, p581
- ISSN
1574-020X
- Publication type
Article
- DOI
10.1007/s10579-015-9303-x