We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
The IMP historical Slovene language resources.
- Authors
Erjavec, Tomaž
- Abstract
The paper describes the combined results of several projects which constitute a basic language resource infrastructure for printed historical Slovene. The IMP language resources consist of a digital library, an annotated corpus and a lexicon, which are interlinked and uniformly encoded following the Text Encoding Initiative Guidelines. The library holds about 650 units (mostly complete books) consisting of facsimiles with 45,000 pages as well as hand-corrected and structured transcriptions. The hand-annotated corpus has 300,000 tokens, where each word is tagged with its modernised word form, lemma, part-of-speech and, in cases of archaic words, its nearest contemporary equivalents. This information was extracted into the lexicon, which also covers an extended target-annotated corpus, resulting in 20,000 lemmas (of these 4,000 archaic) with 50,000 modern word forms and 70,000 attested forms. We have also developed a program to modernise, tag and lemmatise historical Slovene, and annotated the digital library with it, producing an automatically annotated corpus of 15 million words. To serve the humanities, the digital library and lexicon are available for reading and browsing on the web and the corpora via a concordancer. For language technology research and development the resources are available in source TEI XML under the Creative Commons Attribution licence. The paper presents the IMP resources, available from , the process of their compilation, encoding and dissemination, and concludes with directions for future research.
- Subjects
FOREIGN language education; LEXICON; DIGITAL libraries; INFORMATION storage &; retrieval systems; SYNTAX (Grammar)
- Publication
Language Resources & Evaluation, 2015, Vol 49, Issue 3, p753
- ISSN
1574-020X
- Publication type
Article
- DOI
10.1007/s10579-015-9294-7