Your institution may have access to this item. Find your institution then sign in to continue.

Title: The strategy on replicate and similar web collections' detecting and clustering.
Authors: Gao, Kai
Abstract: The fraction of the web collections consisting of duplicates has been surveyed at a high level. The quality of a web crawler increases if it can assess whether a newly crawled web page is a duplicate of a previously crawled page, so the strategy on detecting and filtering duplicates is important to improve the performance of search engine. This paper first presents the algorithm based on URL hashing to filter duplicates. On the basis of the Chinese key-concepts weighting and extracting, this paper then proposes a method to cluster similar results based on content analysis. Both the experimental results and the application validate the feasibility of the approach as this can minimize the overlap while clustering similar results effectively. Some existing problems and further works are also present in the end. © 2009 Wiley Periodicals, Inc. Comput Appl Eng Educ 20: 221-231, 2012
Subjects: WEB development; WEB design; DOCUMENT clustering; UNIFORM Resource Locators; SEARCH engine programming; SEARCH engine software
Publication: Computer Applications in Engineering Education, 2012, Vol 20, Issue 2, p221
ISSN: 1061-3773
Publication type: Article
DOI: 10.1002/cae.20388

We found a match

The strategy on replicate and similar web collections' detecting and clustering.