Kołcz, A.; Chowdhury, A.

doi:10.1007/s11227-007-0171-z

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Lexicon randomization for near-duplicate detection with I-Match.
Authors: Kołcz, A.; Chowdhury, A.
Abstract: Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional techniques relying on direct inter-document similarity computation are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, while very attractive computationally, can be unstable even to small perturbations of document content, which causes signature fragmentation. We focus on I-Match and present a randomization-based technique of increasing its signature stability, with the proposed method consistently outperforming traditional I-Match by as high as 40–60% in terms of the relative improvement in near-duplicate recall. Importantly, the large gains in detection accuracy are offset by only small increases in computational requirements. We also address the complimentary problem of spurious matches, which is particularly important for I-Match when fingerprinting long documents. Our discussion is supported by experiments involving large web-page and email datasets.
Subjects: LEXICON Energy (Company); REPRODUCTION of money, documents, etc.; DATABASE searching; DATA mining; SEARCH engines; KNOWLEDGE management; ONLINE data processing; INFORMATION resources management; DETECTORS
Publication: Journal of Supercomputing, 2008, Vol 45, Issue 3, p255
ISSN: 0920-8542
Publication type: Article
DOI: 10.1007/s11227-007-0171-z

We found a match

Lexicon randomization for near-duplicate detection with I-Match.

Kołcz, A.; Chowdhury, A.

LEXICON Energy (Company); REPRODUCTION of money, documents, etc.; DATABASE searching; DATA mining; SEARCH engines; KNOWLEDGE management; ONLINE data processing; INFORMATION resources management; DETECTORS

Journal of Supercomputing, 2008, Vol 45, Issue 3, p255

0920-8542

Article

10.1007/s11227-007-0171-z