We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Arabic texts analysis for topic modeling evaluation.
- Authors
Brahmi, Abderrezak; Ech-Cherif, Ahmed; Benyettou, Abdelkader
- Abstract
Significant progress has been made in information retrieval covering text semantic indexing and multilingual analysis. However, developments in Arabic information retrieval did not follow the extraordinary growth of Arabic usage in the Web during the ten last years. In the tasks relating to semantic analysis, it is preferable to directly deal with texts in their original language. Studies on topic models, which provide a good way to automatically deal with semantic embedded in texts, are not complete enough to assess the effectiveness of the approach on Arabic texts. This paper investigates several text stemming methods for Arabic topic modeling. A new lemma-based stemmer is described and applied to newspaper articles. The Latent Dirichlet Allocation model is used to extract latent topics from three Arabic real-world corpora. For supervised classification in the topics space, experiments show an improvement when comparing to classification in the full words space or with root-based stemming approach. In addition, topic modeling with lemma-based stemming allows us to discover interesting subjects in the press articles published during the 2007-2009 period.
- Subjects
INFORMATION retrieval research; ARABIC language; WEB analytics; CORPORA; UNBIS (Information retrieval system); DIACRITICS
- Publication
Information Retrieval Journal, 2012, Vol 15, Issue 1, p33
- ISSN
1386-4564
- Publication type
Article
- DOI
10.1007/s10791-011-9171-y