dnbaker / wmh

Weighted Minhash Code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Contains re-implementations of data structures from https://dl.acm.org/doi/10.1145/3219819.3220089 and https://arxiv.org/abs/1911.00675, as well as Python bindings.

Installation:

git clone --recursive https://github.com/dnbaker/wmh
cd wmh/python
python3 setup.py install

Use:

import wmh
import numpy as np

weights = np.random.rand(1000)
ids = np.random(100000, size=(1000,))
signature_size = 500
sigs = [wmh.hash(weights, ids, m=signature_size, stype=sketcher) for sketcher in
        ("bmh1", "bmh2", "pmh1", "phm1a")]

PMinHash effectively normalizes all counts, while bmh treats the id, w pairs as weighted elements in a weighted set.

About

Weighted Minhash Code


Languages

Language:C++ 75.3%Language:C 15.7%Language:Python 7.5%Language:Makefile 1.5%