We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Ang PaggamJt ngTrigram Ranking Bilang Panukat sa Pagkakahalintulad at Pagkakapangkat ng mga Wika.
- Authors
Oco, Nathaniel; Sison-Buban, Raquel; Syliongka, Leif Romeritch; Roxas, Rachel Edita; llao, Joel
- Abstract
A trigram is a 3-letter sequence of a word. As an example, the lists of trigrams that can be generated from the word "tatlo" are the following: tat, atl, and tlo. Presented in this research is trigram ranking, a metric for language similarity. It involves [1] collecting huge amounts of texts as training data, [2] generating trigram profiles from the training data, [3] and computing for language similarity using trigrams. Also presented is the use of k-means clustering to group languages based on their trigram ranking. In this study, the Internet was mined for texts using automatic means: [1] an XML to text converter was used to gather English and Filipino Wikipedia articles; [2] a webcrawler was used to collect online news articles; [3] a twitter API was used to collect tweets; and [4] a hot was used to collect chat logs from Ragnarok, an online game. Documents from a parallel corpus and documents from an online corpus were also collected. The following languages were used as test bed: Bikol, Cebuano, Hiligaynon, Iloko, Pampanga, Pangasinan, Tagalog, and War ay. Based on the results, language pairs with trigram rankings close to each other come from the same subfamily of languages: [1] Bikol, Cebuano, Hiligaynon, Tagalog, and Waray come from one subgroup; [2] Iloko and Pangasinan come from one subgroup; and [3] Pampanga comes from another subgroup. Trigram ranking can be used to measure which Philippine languages are closely-related.
- Subjects
PHILIPPINE languages; LANGUAGE &; languages; COMMUNICATION; ETHNOLOGY; PHILOLOGY
- Publication
Malay, 2014, Vol 26, Issue 2, p53
- ISSN
0115-6195
- Publication type
Article