-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WCSS scores #17
Comments
Hi @JaneChik thanks for reporting this. We never saw this behavior and it looks like a bug... although I can't think of what it could be the cause of it. I have tried to reproduce the bug, but with the following reproducible example I coudln't:
Do you mind sharing with me the I'm sure this is not the issue, but Finally, can you please confirm which version of R, Bioconductor, and mbkmeans you are using? Best, |
Thank you, Davide! 1. DatasetHere is the example dataset I was using (pca_df.rds). Also sorry, I wrote not correctly, it is a matrix, with
I think, that it is probably something with the data, but I can't figure out what it could be. 2. Random dataJust tried randomly generated data. I guess, it is rather strange example, but with such duplicated dataset the same plots are observed: Although, I haven't observed any duplicates in my data. 3. Versions usedProbably it can be the problem
I can't use another R version there. So, I also tried on another machine with I guess, I can try development version installing from github... |
Hi @JaneChik Thanks for sending the data and for further exploring this with randomly generated data. I was able to reproduce your results. It looks indeed related to repeated data points (at least for the randomly generated data): when you have a cluster made of exactly the same observation repeated multiple times, the WCSS_per_cluster is sometimes (correctly) 0 and sometimes a very large number. My guess is that this is due to numeric errors in R, but I will have to further explore this. In the real data, it looks like the problems are when there are clusters of only one observation. I will check the code and report back. As a side note, the data in pca_df don't look very good. Typically, in single-cell RNA-seq you want to perform PCA on log-normalized counts (and perhaps use Best, |
Hi! I am using your very nice mbkmeans function in order to cluster single-cell data based on principal components.
Problem
I have noticed that WCSS scores differ across runs on the same data dramatically (for example from 3000 to 1400000). The problem is that with such variability WCSS value is not decreasing as the number of clusters increases.
Parameters of mbkmeans:
num_init = 100
,calc_wcss = TRUE
Here was used dataset with 206 cells, but same results I saw for other datasets too.
What I have tried:
num_init
: such unexpected results were observed less frequently, but still existedbatch_size = 0.8*dataset_size
: as 206<500 the whole dataset was used, so I tried to decrease the batch size a little and it have some good impact, but still there are unexpected high valuesCode for the next plots:
pca_df
- data frame with PCs for every cellExamples (wcss plots)
Here is how the plot with wcss values looks like on average (quite good):

Here are examples of strange behavior:
For testing the same clustering was performed with

stats::kmeans
(also with 100 initializations) and there is no problem.As 206<500 and the whole dataset was then used for clustering, I thought that results of mbkmeans and kmeans should be similar somehow.
Examples (repeated clustering)
Repeated custering with
k <- c(1:10, 20)
:mbkmeans
For example, in 7 out of 50 runs last WCSS score (for
k=20
) was the greatest value of all.stats::kmeans
:Can you please suggest what could be the reason of such unexpected high values for WCSS scores and is it possible to fix somehow?
The text was updated successfully, but these errors were encountered: