vpekar/sklearn-majorclust

clustering python scikit-learn

A scikit-learn API for the MajorClust clustering algorithm. The implementation accepts an optional parameter, sim_threshold, below which all cosine values in the similarity matrix will be set to zero. Tuning the parameter often helps to get rid of spurious similarities and bring out relevant clusters.

The implemention re-uses this gist.

Example use:

import majorclust
from sklearn.feature_extraction.text import TfidfVectorizer

# input documents
texts = [
    "foo blub baz",
    "foo bar baz",
    "asdf bsdf csdf",
    "foo bab blub",
    "csdf hddf kjtz",
    "123 456 890",
    "321 890 456 foo",
    "123 890 uiop"]

mc = majorclust.MajorClust(sim_threshold=0.0)
X = TfidfVectorizer().fit_transform(texts)
mc.fit(X)

# print output
d = {}
for text, label in zip(texts, mc.labels_):
    d[label] = d.get(label, [])
    d[label].append(text)

for label, texts in sorted(d.items()):
    print("Cluster id %d:" % label)
    for t in texts:
        print(t)
    print("="*20)

Output:

Cluster id 1:
foo blub baz
foo bar baz
foo bab blub
====================
Cluster id 4:
asdf bsdf csdf
csdf hddf kjtz
====================
Cluster id 7:
123 456 890
321 890 456 foo
123 890 uiop
====================

About

A scikit-learn API for the MajorClust clustering algorithm.

clustering python scikit-learn

BSD 3-Clause "New" or "Revised" License

Languages

Language:Python 100.0%