scalanlp / nak

The Nak Machine Learning Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve k-means code. #helpwanted

jasonbaldridge opened this issue · comments

The current k-means implementation is something I did for homework assignments for teaching NLP courses at UT Austin. It can handle a fair amount, but it runs out of steam (in particular, memory) for larger datasets, especially if they have a lot of features. It currently uses dense vectors to represent the features for each data point, so it should be a fairly straightforward win to change this to use sparse vectors instead.

As is my (bad) habit, the K-means(++) impl in breeze is generic on vector
type, so can use SparseVectors.

-- David

On Tue, Apr 16, 2013 at 12:45 PM, Jason Baldridge
notifications@github.comwrote:

The current k-means implementation is something I did for homework
assignments for teaching NLP courses at UT Austin. It can handle a fair
amount, but it runs out of steam (in particular, memory) for larger
datasets, especially if they have a lot of features. It currently uses
dense vectors to represent the features for each data point, so it should
be a fairly straightforward win to change this to use sparse vectors
instead.


Reply to this email directly or view it on GitHubhttps://github.com//issues/10
.

Awesome. This may be sorted out directly as we transition things from Breeze then.