Your institution may have access to this item. Find your institution then sign in to continue.

Title: From 0 to 10 million annotated words: part-of-speech tagging for Middle High German.
Authors: Schulz, Sarah; Ketschik, Nora
Abstract: By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.
Subjects: NATURAL language processing; DATA quality; TRAINING needs
Publication: Language Resources & Evaluation, 2019, Vol 53, Issue 4, p837
ISSN: 1574-020X
Publication type: Article
DOI: 10.1007/s10579-019-09462-8

We found a match

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German.