We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Tswana finite state tokenisation.
- Authors
Pretorius, Laurette; Viljoen, Biffie; Berg, Ansu; Pretorius, Rigardt
- Abstract
Tswana, a Bantu language in the Sotho group, is characterised by an agglutinative morphology and a disjunctive orthography, which mainly affects the verb category. In particular, verbal prefixes are usually written disjunctively, while suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two finite state tokeniser transducers and a finite state morphological analyser are combined to solve the Tswana (verb) tokenisation problem. The approach has the important advantage of bringing the processing of Tswana, beyond the morphological analysis level, in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when implemented and evaluated, yields an F-score of 95 % with respect to a hand tokenised gold standard.
- Subjects
TSWANA language; ORTHOGRAPHY &; spelling; NGUNI languages; BANTU languages; MORPHOLOGY
- Publication
Language Resources & Evaluation, 2015, Vol 49, Issue 4, p831
- ISSN
1574-020X
- Publication type
Article
- DOI
10.1007/s10579-014-9292-1