Near-duplicates and shingling

Description

can now generate all pairs $i,j$ for which $x_i^\pi$ is present in both their sketches. From these we can compute, for each pair $i,j$ with non-zero sketch overlap, a count of the number of $x_i^\pi$ values they have in common. By applying a preset threshold, we know which pairs $i,j$ have heavily overlapping sketches. For instance, if the threshold were 80%, we would need the count to be at least 160 for any $i,j$. As we identify such pairs, we run the union-find to group documents into near-duplicate ``syntactic clusters''. This is essentially a variant of the single-link clustering algorithm introduced in Section 17.2 (page [*]).

Preview

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

BibSonomy

Near-duplicates and shingling

http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

Description

Preview

Tags

Users

Comments and Reviewsshow / hide