Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WCSS scores #17

Open
JaneChik opened this issue Feb 4, 2022 · 3 comments
Open

WCSS scores #17

JaneChik opened this issue Feb 4, 2022 · 3 comments

Comments

@JaneChik
Copy link

JaneChik commented Feb 4, 2022

Hi! I am using your very nice mbkmeans function in order to cluster single-cell data based on principal components.

Problem

I have noticed that WCSS scores differ across runs on the same data dramatically (for example from 3000 to 1400000). The problem is that with such variability WCSS value is not decreasing as the number of clusters increases.

Parameters of mbkmeans:

num_init = 100, calc_wcss = TRUE
Here was used dataset with 206 cells, but same results I saw for other datasets too.

What I have tried:

  • increased the number of num_init: such unexpected results were observed less frequently, but still existed
  • set batch_size = 0.8*dataset_size: as 206<500 the whole dataset was used, so I tried to decrease the batch size a little and it have some good impact, but still there are unexpected high values

Code for the next plots:

pca_df - data frame with PCs for every cell

wcss_score <- function(k, df){
  km <- mbkmeans(df, clusters = k, num_init = 100, calc_wcss = TRUE)
  wcss <- sum(km[["WCSS_per_cluster"]])
  wcss
}
k <- c(1:20)
total_wcss <- sapply(k, wcss_score, pca_df)
plot(k, type='b', total_wcss, xlab='Number of clusters', ylab='Total WCSS', frame=FALSE)

Examples (wcss plots)

Here is how the plot with wcss values looks like on average (quite good):
alright

Here are examples of strange behavior:

  • for the most of the strange runs it looks like this
    bad_typical
  • but also wcss plot can look like this:
    bad1
    bad2
    bad3

For testing the same clustering was performed with stats::kmeans (also with 100 initializations) and there is no problem.
kmeans
As 206<500 and the whole dataset was then used for clustering, I thought that results of mbkmeans and kmeans should be similar somehow.

Examples (repeated clustering)

Repeated custering with k <- c(1:10, 20) :

  1. mbkmeans
> bad <- 0
> for (i in 1:50) {
+   k <- c(1:10, 20)
+   total_wcss <- sapply(k, wcss_score, pca_df)
+   if (max(total_wcss) == total_wcss[length(total_wcss)]) {
+     bad <- bad + 1
+   }
+ }
> bad
[1] 7

For example, in 7 out of 50 runs last WCSS score (for k=20) was the greatest value of all.

  1. stats::kmeans:
> bad <- 0
> for (i in 1:100) {
+   k <- c(1:10, 20)
+   total_wcss <- sapply(k, function(kk) {
+     kmeans(t(pca_df), centers = kk, nstart = 100)$tot.withinss
+   })
+   if (max(total_wcss) == total_wcss[length(total_wcss)]) {
+     bad <- bad + 1
+   }
+ }
> bad
[1] 0

Can you please suggest what could be the reason of such unexpected high values for WCSS scores and is it possible to fix somehow?

@drisso
Copy link
Owner

drisso commented Feb 7, 2022

Hi @JaneChik

thanks for reporting this. We never saw this behavior and it looks like a bug... although I can't think of what it could be the cause of it.

I have tried to reproduce the bug, but with the following reproducible example I coudln't:

library(mbkmeans)

p <- 10
n <- 200

y <- matrix(rnorm(n*p), nrow=p, ncol=n)

wcss_score <- function(k, df){
  km <- mbkmeans(df, clusters = k, num_init = 100, calc_wcss = TRUE)
  wcss <- sum(km[["WCSS_per_cluster"]])
  wcss
}

k <- c(1:20)
total_wcss <- sapply(k, wcss_score, y)
plot(k, type='b', total_wcss, xlab='Number of clusters', ylab='Total WCSS', frame=FALSE)

bad <- 0
for (i in 1:50) {
   k <- c(1:10, 20)
   total_wcss <- sapply(k, wcss_score, y)
   if (which.max(total_wcss) != 1) {
     bad <- bad + 1
   }
}
bad

Do you mind sharing with me the df object that you are using to generate the plots above to see if I can reproduce the issue? Alternatively (if you don't want to or can't share the data), does the same behavior happen for some randomly generated data?

I'm sure this is not the issue, but mbkmeans does not allow data frames as input and wants a matrix that has genes (or in your case PCs) as rows and cells as columns.

Finally, can you please confirm which version of R, Bioconductor, and mbkmeans you are using?

Best,
Davide

@JaneChik
Copy link
Author

JaneChik commented Feb 7, 2022

Thank you, Davide!

1. Dataset

Here is the example dataset I was using (pca_df.rds).
pca_df.zip

Also sorry, I wrote not correctly, it is a matrix, with data.frame clustering indeed doesn't work at all:

> class(pca_df)
[1] "matrix" "array" 

I think, that it is probably something with the data, but I can't figure out what it could be.

2. Random data

Just tried randomly generated data. I guess, it is rather strange example, but with such duplicated dataset the same plots are observed:
y <- matrix( rep(rnorm(200), 10), nrow = 10, ncol = 200 )

Although, I haven't observed any duplicates in my data.

3. Versions used

Probably it can be the problem

  • R=4.0.5
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
...
  • Bioconductor=3.12
> packageVersion("BiocVersion")[,1:2]
[1] ‘3.12’
  • mbkmeans=1.6.1
> packageVersion("mbkmeans")
[1] ‘1.6.1’

I can't use another R version there.

So, I also tried on another machine with R=4.1.2, Bioconductor=3.13.1 and mbkmeans=1.8.0 and got the same results. But I don't know if it is the case for the latest version.

I guess, I can try development version installing from github...

@drisso
Copy link
Owner

drisso commented Feb 8, 2022

Hi @JaneChik

Thanks for sending the data and for further exploring this with randomly generated data. I was able to reproduce your results.

It looks indeed related to repeated data points (at least for the randomly generated data): when you have a cluster made of exactly the same observation repeated multiple times, the WCSS_per_cluster is sometimes (correctly) 0 and sometimes a very large number. My guess is that this is due to numeric errors in R, but I will have to further explore this.

In the real data, it looks like the problems are when there are clusters of only one observation. I will check the code and report back.

As a side note, the data in pca_df don't look very good. Typically, in single-cell RNA-seq you want to perform PCA on log-normalized counts (and perhaps use scale=TRUE in the PCA function), are you doing that here?

Best,
Davide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants