The growth of the Internet challenges Internet Search Engines as more copies of Web documents flood over search results making them less relevant to users. A method of "descriptive words" for definition of near-duplicates of documents is suggested. It is based on the choice of N words from the index to determine a "signature" of a document and can be applied to any search engine based on the inverted index. It is compared with
the method based on "shingles". At a practically equal accuracy of algorithms, this method is more efficient in the presence of inverted index.
World Wide Web Conference
29 Dec 2002