We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
An approach to enhance topic modeling by using paratext and nonnegative matrix factorizations.
- Authors
Flores-Garrido, Marisol; García-Velázquez, Luis Miguel; López-Vázquez, Julieta Arisbe
- Abstract
Given the growing expansion in the development and use of computational methods in humanities research, it is necessary to propose methodologies that properly explore the questions posed by different disciplines, considering the locality of both data and the process behind its generation. In the present work, we explore the problem of automatically identifying the main topics in collections of Nahua discourses known as huehuetlahtollis. Each document in the collections is introduced through an extended title, and it is a natural question if enhancing the role of title terms during the unsupervised learning process could enrich results. Aiming at explainability, we consider a model based on nonnegative matrix factorizations (NMF). An overview of the historical process behind the composition of the explored corpora suggests that titles reflect the point of view of the collection's compiler in manners that justify viewing the paratext as a supplementary source on the material. Therefore, we propose a bi-objective NMF scheme that appropriately reflects the a priori knowledge on the corpus, linking and combining the information of titles and content to improve the accuracy in identifying topic groups and relevant terms within a corpus. By comparing three different schemes against the labels assigned by an expert, we show that our model better reflects the nature of data, translating into higher accuracy. Finally, we present some insights on the studied corpora derived from our analysis of identified relevant terms.
- Subjects
MATRIX decomposition; NONNEGATIVE matrices; PARATEXT; CORPORA; ELECTRONIC data processing
- Publication
Digital Scholarship in the Humanities, 2023, Vol 38, Issue 1, p87
- ISSN
2055-768X
- Publication type
Article
- DOI
10.1093/llc/fqac043