We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
AUTOMATIC TAGGING OF PERSIAN WEB PAGES BASED ON N-GRAM LANGUAGE MODELS USING MAPREDUCE.
- Authors
Shahrivari, Saeed; Rahmani, Saeed; Keshavarz, Hooman
- Abstract
Page tagging is one of the most important facilities for increasing the accuracy of information retrieval in the web. Tags are simple pieces of data that usually consist of one or several words, and briefly describe a page. Tags provide useful information about a page and can be used for boosting the accuracy of searching, document clustering, and result grouping. The most accurate solution to page tagging is using human experts. However, when the number of pages is large, humans cannot be used, and some automatic solutions should be used instead. We propose a solution called PerTag which can automatically tag a set of Persian web pages. PerTag is based on n-gram models and uses the tf-idf method plus some effective Persian language rules to select proper tags for each web page. Since our target is huge sets of web pages, PerTag is built on top of the MapReduce distributed computing framework. We used a set of more than 500 million Persian web pages during our experiments, and extracted tags for each page using a cluster of 40 machines. The experimental results show that PerTag is both fast and accurate.
- Subjects
TAGS (Metadata); INFORMATION retrieval; N-gram models (Computational linguistics); PERSIAN language; DOCUMENT clustering
- Publication
ICTACT Journal on Soft Computing, 2015, Vol 5, Issue 4, p1003
- ISSN
0976-6561
- Publication type
Article
- DOI
10.21917/ijsc.2015.0140