An efficient method to detect duplicates of Web documents with the use of inverted index

The growth of the Internet challenges Internet Search Engines as more copies of Web documents flood over search results making them less relevant to users. A method of "descriptive words" for definition of near-duplicates of documents is suggested. It is based on the choice of N words from the index to determine a "signature" of a document and can be applied to any search engine based on the inverted index. It is compared with the method based on "shingles". At a practically equal accuracy of algorithms, this method is more efficient in the presence of inverted index.
Research areas
Published in
World Wide Web Conference
29 Dec 2002