the problem with duplicated data set when updating hidden variable

Question

the problem with duplicated data set when updating hidden variable

xiaojin-hu opened this issue 4 years ago · comments

The data I used was actually collected, and there are many of the same values. However, the data in the given code examples are all generated by sampling, and all data points are different. My initialization in this case: First use np.unique to remove the duplicate values of all data points, and the remaining sample points are used as the mean initialization. The corresponding cluster number is initialized using the number of sample means; the initialization of the mixing coefficient uses the mean The frequency of each data point is divided by the total data point. When the program is running, there will be problems in updating the hidden variable z: min (self.z_.sum (axis = 1)) = 0; that is, there are some data points in the data set that do not belong to all Gaussian sub-models.
I look forward to your assistance in solving this problem. Thank you! Salute you

我使用的数据是实际情况下采集的，存在很多相同的值。然而所给的代码例子中的数据都是采样生成的，所有的数据点都不同。我的这种情况的初始化：首先使用np.unique把所有的数据点的重复值去掉，剩下的样本点作为均值初始化，相应聚类数初始化使用样本均值的数量；混合系数的初始化使用均值中每个数据点的频率除以总的数据点。在程序运行过程中，在更新隐变量z会出现问题：min(self.z_.sum(axis=1))=0；即数据集合中存在部分数据点不属于所有的高斯分模型。
期待您能帮助解决这个问题。谢谢！向您致敬

HongJea Park · Answer 1 · Sun May 17 2020 21:53:14 GMT+0800 (China Standard Time)

Hi xiaojin-hu.

Because of duplicated data points, you removed duplicated data points to initialize robust EM algorithm. In my opinion, the problem you encountered when fitting the model may be because of the process updating the number of components.

In robust_EM_for_gmm/robustEM/rEM.py, the number of components is updated by the function robustEM.update_c.

robust_EM_for_gmm/robustEM/rEM/robustEM.update_c

def update_c(self):
        
        '''
            Update the number of components.
            This function is refered term 14, 15 and 16 in the paper.
        '''
        
        idx_bool = self.pi_ >= 1 / self.n_
        new_c = idx_bool.sum()
        
        pi = self.pi_[idx_bool]
        self.pi_ = pi / pi.sum()
        
        z = self.z_[:, idx_bool]
        self.z_ = z / z.sum(axis= 1).reshape(-1, 1)
        
        self.means_ = self.means_[idx_bool, :]
        
        return new_c

Let self.origin_n_ be the number of origin data points and self.remain_n_ be the number of remaining data points. When I implemented this paper, I assume that all of data points are different each others. So self.n_ in the update_c is equal to self.remain_n_.

In the function, idx_bool = self.pi_ >= 1 / self.n_ means that the components whose mixing proportions is less than 1 / n cannot generate data. So these components should be removed.

But in the situation that duplicated data points are exist, you should modify idx_bool = self.pi_ >= 1 / self.n_ to idx_bool = self.pi_ >= 1 / self.origin_n_.

I think you do not modify this function.

I uploaded a new code considering the duplicated data set.

Pull out the new version and try again for your data set.

And let me know the results.

xiaojin-hu · Answer 2 · Mon May 18 2020 22:42:26 GMT+0800 (China Standard Time)

Thank you very much for helping me solve the problem in the first time. I used the new version of the code you provided and the code I changed, The first 2000 data(data[1:2000]) points of data data are fitted and analyzed for comparison. Although there is no error in updating the hidden variable Z, the GMM fitting effect of the version you provided is not as good as that of the version I changed (I provide the effect chart for comparison); however, the code I changed will have the problem of updating the hidden variable incorrectly with the increase of data amount (I use All data(data[1:-1) are fitted).
The idea I changed is: after each iteration update, merge the repeated GMM, and other places are the same as yours.
I provide the fitted data and the rEM.py code I changed to your email. Hope to get your help again!

HongJea Park · Answer 3 · Tue May 19 2020 12:56:30 GMT+0800 (China Standard Time)

Thank you for giving me new approach!

The difference between my idea and yours is that in my approach, the step that merge(remove) duplicated gaussian is implemented before iteration but in yours, this step is implemented during iteration.

I tested with the data you give me by email, and I thought your approach is better than me.

As you comment, the problem is update hidden variable z_.
In the remove step, some gaussians are removed and the problem arises.
Some data points have large latent variable values for the removed gaussian and very small latent variable values close to zero for the remaining gaussian.
These data points arise the problem in updating covariance matrix and the number of components.
To handle this problem, I use smoothing parameter by adding latent variable z_ in self.predict_proba().
I tested with your data and there are no error.

I modified your code to be pythonic and updated in my repository.
Check it out.

xiaojin-hu · Answer 4 · Tue May 19 2020 13:39:56 GMT+0800 (China Standard Time)

Thank for you helping me again!
the problem wirh update the hidden variable z_ has been sloved!
Wish you have a good mood！

HongJea Park · Answer 5 · Tue May 19 2020 14:32:04 GMT+0800 (China Standard Time)

Thank you for checking my comment.
Now that the problem is solved, I will close this issue.