We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
The KAS corpus of Slovenian academic writing.
- Authors
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola
- Abstract
The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made freely available via on-line concordancers and is openly available for research through the CLARIN.SI research infrastructure. We also present the tools for mono- and bilingual term extraction and for thesis structure annotation that were developed in the scope of the project, including the manually annotated datasets used to train these tools. This specialised corpus, large by any standards, represents a substantial and highly useful language resource for the study of Slovenian academic writing and for terminology extraction.
- Subjects
ACADEMIC discourse; DIGITAL libraries; CORPORA; UNIVERSITIES &; colleges; GENE ontology
- Publication
Language Resources & Evaluation, 2021, Vol 55, Issue 2, p551
- ISSN
1574-020X
- Publication type
Article
- DOI
10.1007/s10579-020-09506-4