We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
The strategy on replicate and similar web collections' detecting and clustering.
- Authors
Gao, Kai
- Abstract
The fraction of the web collections consisting of duplicates has been surveyed at a high level. The quality of a web crawler increases if it can assess whether a newly crawled web page is a duplicate of a previously crawled page, so the strategy on detecting and filtering duplicates is important to improve the performance of search engine. This paper first presents the algorithm based on URL hashing to filter duplicates. On the basis of the Chinese key-concepts weighting and extracting, this paper then proposes a method to cluster similar results based on content analysis. Both the experimental results and the application validate the feasibility of the approach as this can minimize the overlap while clustering similar results effectively. Some existing problems and further works are also present in the end. © 2009 Wiley Periodicals, Inc. Comput Appl Eng Educ 20: 221-231, 2012
- Subjects
WEB development; WEB design; DOCUMENT clustering; UNIFORM Resource Locators; SEARCH engine programming; SEARCH engine software
- Publication
Computer Applications in Engineering Education, 2012, Vol 20, Issue 2, p221
- ISSN
1061-3773
- Publication type
Article
- DOI
10.1002/cae.20388