wware/minhash

This is based on some slides that I got from David Weisman about a study he did of similarity between pages on PubMed. I'm not sure I'm doing this right, as it takes quite a bit of time. But each paper is reduced to a vector of relatively few integers (50 or 100 or 1000) and we simply count which components are equal by position, and that count is the similarity.

About

http://en.wikipedia.org/wiki/MinHash

Languages

Language:Python 100.0%