ccmien / sofia-ml

Automatically exported from code.google.com/p/sofia-ml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sofia-kmeans diverging with increasing number of iterations?

GoogleCodeExporter opened this issue · comments

What steps will reproduce the problem?
1. Create 2-dimensional data drawn from 2-dim multivariate Gaussian 
distributions with different means variance = 1. e.g 21 different 
distributions, lets say 1000 draws. Total at 21.000 points. (have tried many 
different variations and does not have any positive effect on the reported 
issue)

2. Train sofia-kmeans with any batch size (tested 500:500:5000) and with any 
number of k clusters (tested 64 128 256) using mini_batch_kmeans with fixed 
random seed.

command line: sofia-kmeans --k 64 --dimensionality 3 --random_seed 124 
--init_type random --opt_type mini_batch_kmeans --mini_batch_size 500 
--iterations 10 --objective_after_init --objective_after_training 
--training_file traindatafile.svmlight --model_out modelfile.sofia

3. Calculate the training error
command line: sofia-kmeans --model_in modelfile.sofia --test_file 
traindatafile.svmlight --objective_on_test --cluster_assignments_out 
trainingassignments.sofia

4. run this in a loop as a function of number of iterations. i ran [1 10 100e3 
500e3 and 1000e3]

What is the expected output? What do you see instead?
I expect that the training error would fall as a function of number of 
iterations used. Since it has fixed seed the random initialization is the same. 
This occurs until 100e3 then it start to diverge. i.e. the training error 
starts increasing dramatically. The training error becomes even larger than the 
random initialization. This is very puzzling to me.

What version of the product are you using? On what operating system?
svn checkout http://sofia-ml.googlecode.com/svn/trunk/sofia-ml 
sofia-ml-read-only
performed 10/3-2015
OS: Ubuntu 14.04

Please provide any additional information below.
Attached is the commands and output from sofia-kmeans (sofia_kmeans.txt) and 
furthermore all model, assignment and datafiles are provided to reproduce these 
finding (tmp.zip)


Original issue reported on code.google.com by hr.j...@hotmail.com on 11 Mar 2015 at 12:36

Attachments: