AnDa-creator / SKMeans

Implementation of k-means with cosine distance as distance metric

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SKMeans

Implementation of k-means with cosine distance as the distance metric. The computation of mean is still done in the same way as for standard k-means. Method SKMeans is used to compute k clusters for an input, based on cosine distances.

Requirements

Requires scipy and numpy to execute. To install, follow the instruction on this page.

Usage

There is only one method SKMeans in the source file skmeans.py. To instantiate it,

>>> from skmeans import SKMeans
>>> no_clusters = 300
>>> kmeans_inst = SKMeans(no_clusters,iters=15)

Here, no_clusters are the number of clustes to be generated and iters is the number of iterations for which k-means will be run. The default value of iters is set to 300.

To run k-means on an input matrix X,

>>> X = numpy.random.rand((150,50))
>>> kmeans_inst.fit(X)

X can be a sparse matrix or a numpy array. In this case, when no keyword arguments are provided, the centres are sampled randomly from X. To provide, your own centre values,

>>> centres = numpy.random.rand((10,50))
>>> kmeans_inst.fit(X,sample=False,param_centres=centres)

Here centres can be a sparse matrix or a numpy 2d array. The default value of param_centres is None. If sample is set to False, param_centres should be prvided a matrix.

In case, two pass k-means is to be used (In two pass kmeans, in the first pass, a small set of the input matrix is used to sample centres, then k-means is run over this small set of input and the centres. The new centres found from the first pass are then passed to the second pass of k-means, with the complete input matrix.), call fit method with the following parameter,

>>> kmeans_inst.fit(X,two_pass=True)

The default value of the flag two_pass is False. Setting it to True, it will take precedence over the sample flag and ignore it's value. The number of input samples for the first pass of k-means is determined by the following condition,

>>> no_samples = max(2*np.sqrt(X.shape[0]), 10*self.no_clusters)

Distance Calculation

SKMeans uses a matrix multiplication to calculate the cosine distances, hence, it is fairly fast in partice as compared to other methods, which use scipy.spatial.distance.cdist to compute the cosine distances.

Note

This work is free. You can redistribute it and/or modify it under the terms of the Do Whatever You Want To Public License.

About

Implementation of k-means with cosine distance as distance metric


Languages

Language:Python 100.0%