We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Automatic Encoding and Language Detection in the GSDL.
- Authors
Pinkas, Otakar
- Abstract
Automatic detection of encoding and language of the text is part of the Greenstone Digital Library Software (GSDL) for building and distributing digital collections. It is developed by the University of Waikato (New Zealand) in cooperation with UNESCO. The automatic encoding and language detection in Slavic languages is difficult and it sometimes fails. The aim is to detect cases of failure. The automatic detection in the GSDL is based on n-grams method. The most frequent n-grams for Czech are presented. The whole process of automatic detection in the GSDL is described. The input documents to test collections are plain texts encoded in ISO-8859-1, ISO-8859-2 and Windows-1250. We manually evaluated the quality of automatic detection. To the causes of errors belong the improper language model predominance and the incorrect switch to Windows-1250. We carried out further tests on documents that were more complex.
- Subjects
DIGITAL library software; N-gram models (Computational linguistics); SLAVIC languages; LANGUAGE &; languages; UNIVERSITY of Waikato; UNESCO; COMPUTER software
- Publication
Journal of Systems Integration (1804-2724), 2014, Vol 5, Issue 4, p47
- ISSN
1804-2724
- Publication type
Article
- DOI
10.20470/jsi.v5i4.211