Finding documents that are almost exact duplicates is difficult. Exact hashing algorithms do not work and pairwise comparisons do not scale. Locality Sensitive Hashing (LSH) is a scalable method for detecting near duplicate content that allows computation to be exchanged for accuracy. I will present the theoretical side of LSH and an open source Python implementation of the technique.
I am looking for editors/curators to help with branches of the tree. Please send me an email if you are interested.