We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
A 700M+ Arabic corpus: KACST Arabic corpus design and construction.
- Authors
Al-Thubaity, Abdulmohsen
- Abstract
Compared with English, Arabic is a poorly-resourced language within the field of corpus linguistics. A lack of sufficient data and research has negatively affected Arabic corpus-based researchers and natural language processing practitioners. Although a number of Arabic corpora have been developed in recent years, the overall situation has improved little. The aim of this paper is twofold. First, it reviews 14 Arabic corpora categorized by their designated purpose, target language, mode of text, size, text date, location, text type/medium, text domain, representativeness, and balance. The review also describes the availability of the reviewed corpora, the presence of tokenization, lemmatization and tagging, and whether there are any tools available to search and explore them. Second, it introduces the King Abdulaziz City for Science and Technology (KACST) Arabic corpus, which was designed and created to overcome the limitations of existing Arabic corpora. The KACST Arabic corpus is a large and diverse Arabic corpus with clearly defined design criteria. It is carefully sampled, and its contents are classified based on time, region, medium, domain, and topic, and it can be searched and explored using these classifications. The KACST Arabic corpus comprises more than 700 million words from the pre-Islamic era to the present day (a period covering more than 1,500 years), collected from 10 diverse mediums. Each text has been further classified more specifically into domains and topics. The KACST Arabic corpus is freely available to explore on the Internet () using a variety of tools.
- Subjects
ARABIC language education; LINGUISTIC analysis; FOREIGN language education; ARTISTIC creation; TAGMEMICS
- Publication
Language Resources & Evaluation, 2015, Vol 49, Issue 3, p721
- ISSN
1574-020X
- Publication type
Article
- DOI
10.1007/s10579-014-9284-1