We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
From 0 to 10 million annotated words: part-of-speech tagging for Middle High German.
- Authors
Schulz, Sarah; Ketschik, Nora
- Abstract
By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.
- Subjects
NATURAL language processing; DATA quality; TRAINING needs
- Publication
Language Resources & Evaluation, 2019, Vol 53, Issue 4, p837
- ISSN
1574-020X
- Publication type
Article
- DOI
10.1007/s10579-019-09462-8