-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the problem with duplicated data set when updating hidden variable #1
Comments
Hi xiaojin-hu. Because of duplicated data points, you removed duplicated data points to initialize robust EM algorithm. In my opinion, the problem you encountered when fitting the model may be because of the process updating the number of components. In
Let In the function, But in the situation that duplicated data points are exist, you should modify I think you do not modify this function. I uploaded a new code considering the duplicated data set. Pull out the new version and try again for your data set. And let me know the results. |
Thank you very much for helping me solve the problem in the first time. I used the new version of the code you provided and the code I changed, The first 2000 data(data[1:2000]) points of data data are fitted and analyzed for comparison. Although there is no error in updating the hidden variable Z, the GMM fitting effect of the version you provided is not as good as that of the version I changed (I provide the effect chart for comparison); however, the code I changed will have the problem of updating the hidden variable incorrectly with the increase of data amount (I use All data(data[1:-1) are fitted). |
Thank you for giving me new approach! The difference between my idea and yours is that in my approach, the step that merge(remove) duplicated gaussian is implemented before iteration but in yours, this step is implemented during iteration. I tested with the data you give me by email, and I thought your approach is better than me. As you comment, the problem is update hidden variable I modified your code to be pythonic and updated in my repository. |
Thank for you helping me again! |
Thank you for checking my comment. |
The data I used was actually collected, and there are many of the same values. However, the data in the given code examples are all generated by sampling, and all data points are different. My initialization in this case: First use np.unique to remove the duplicate values of all data points, and the remaining sample points are used as the mean initialization. The corresponding cluster number is initialized using the number of sample means; the initialization of the mixing coefficient uses the mean The frequency of each data point is divided by the total data point. When the program is running, there will be problems in updating the hidden variable z: min (self.z_.sum (axis = 1)) = 0; that is, there are some data points in the data set that do not belong to all Gaussian sub-models.
I look forward to your assistance in solving this problem. Thank you! Salute you
我使用的数据是实际情况下采集的,存在很多相同的值。然而所给的代码例子中的数据都是采样生成的,所有的数据点都不同。我的这种情况的初始化:首先使用np.unique把所有的数据点的重复值去掉,剩下的样本点作为均值初始化,相应聚类数初始化使用样本均值的数量;混合系数的初始化使用均值中每个数据点的频率除以总的数据点。在程序运行过程中,在更新隐变量z会出现问题:min(self.z_.sum(axis=1))=0;即数据集合中存在部分数据点不属于所有的高斯分模型。
期待您能帮助解决这个问题。谢谢!向您致敬
The text was updated successfully, but these errors were encountered: