WCSS scores #17

JaneChik · 2022-02-04T21:31:08Z

Hi! I am using your very nice mbkmeans function in order to cluster single-cell data based on principal components.

Problem

I have noticed that WCSS scores differ across runs on the same data dramatically (for example from 3000 to 1400000). The problem is that with such variability WCSS value is not decreasing as the number of clusters increases.

Parameters of mbkmeans:

num_init = 100, calc_wcss = TRUE
Here was used dataset with 206 cells, but same results I saw for other datasets too.

What I have tried:

increased the number of num_init: such unexpected results were observed less frequently, but still existed
set batch_size = 0.8*dataset_size: as 206<500 the whole dataset was used, so I tried to decrease the batch size a little and it have some good impact, but still there are unexpected high values

Code for the next plots:

pca_df - data frame with PCs for every cell

wcss_score <- function(k, df){
  km <- mbkmeans(df, clusters = k, num_init = 100, calc_wcss = TRUE)
  wcss <- sum(km[["WCSS_per_cluster"]])
  wcss
}
k <- c(1:20)
total_wcss <- sapply(k, wcss_score, pca_df)
plot(k, type='b', total_wcss, xlab='Number of clusters', ylab='Total WCSS', frame=FALSE)

Examples (wcss plots)

Here is how the plot with wcss values looks like on average (quite good):

Here are examples of strange behavior:

for the most of the strange runs it looks like this
but also wcss plot can look like this:

For testing the same clustering was performed with stats::kmeans (also with 100 initializations) and there is no problem.

As 206<500 and the whole dataset was then used for clustering, I thought that results of mbkmeans and kmeans should be similar somehow.

Examples (repeated clustering)

Repeated custering with k <- c(1:10, 20) :

mbkmeans

> bad <- 0
> for (i in 1:50) {
+   k <- c(1:10, 20)
+   total_wcss <- sapply(k, wcss_score, pca_df)
+   if (max(total_wcss) == total_wcss[length(total_wcss)]) {
+     bad <- bad + 1
+   }
+ }
> bad
[1] 7

For example, in 7 out of 50 runs last WCSS score (for k=20) was the greatest value of all.

stats::kmeans:

> bad <- 0
> for (i in 1:100) {
+   k <- c(1:10, 20)
+   total_wcss <- sapply(k, function(kk) {
+     kmeans(t(pca_df), centers = kk, nstart = 100)$tot.withinss
+   })
+   if (max(total_wcss) == total_wcss[length(total_wcss)]) {
+     bad <- bad + 1
+   }
+ }
> bad
[1] 0

Can you please suggest what could be the reason of such unexpected high values for WCSS scores and is it possible to fix somehow?

The text was updated successfully, but these errors were encountered:

drisso · 2022-02-07T17:01:09Z

Hi @JaneChik

thanks for reporting this. We never saw this behavior and it looks like a bug... although I can't think of what it could be the cause of it.

I have tried to reproduce the bug, but with the following reproducible example I coudln't:

library(mbkmeans)

p <- 10
n <- 200

y <- matrix(rnorm(n*p), nrow=p, ncol=n)

wcss_score <- function(k, df){
  km <- mbkmeans(df, clusters = k, num_init = 100, calc_wcss = TRUE)
  wcss <- sum(km[["WCSS_per_cluster"]])
  wcss
}

k <- c(1:20)
total_wcss <- sapply(k, wcss_score, y)
plot(k, type='b', total_wcss, xlab='Number of clusters', ylab='Total WCSS', frame=FALSE)

bad <- 0
for (i in 1:50) {
   k <- c(1:10, 20)
   total_wcss <- sapply(k, wcss_score, y)
   if (which.max(total_wcss) != 1) {
     bad <- bad + 1
   }
}
bad

Do you mind sharing with me the df object that you are using to generate the plots above to see if I can reproduce the issue? Alternatively (if you don't want to or can't share the data), does the same behavior happen for some randomly generated data?

I'm sure this is not the issue, but mbkmeans does not allow data frames as input and wants a matrix that has genes (or in your case PCs) as rows and cells as columns.

Finally, can you please confirm which version of R, Bioconductor, and mbkmeans you are using?

Best,
Davide

JaneChik · 2022-02-07T18:34:58Z

Thank you, Davide!

1. Dataset

Here is the example dataset I was using (pca_df.rds).
pca_df.zip

Also sorry, I wrote not correctly, it is a matrix, with data.frame clustering indeed doesn't work at all:

> class(pca_df)
[1] "matrix" "array"

I think, that it is probably something with the data, but I can't figure out what it could be.

2. Random data

Just tried randomly generated data. I guess, it is rather strange example, but with such duplicated dataset the same plots are observed:
y <- matrix( rep(rnorm(200), 10), nrow = 10, ncol = 200 )

Although, I haven't observed any duplicates in my data.

3. Versions used

Probably it can be the problem

R=4.0.5

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
...

Bioconductor=3.12

> packageVersion("BiocVersion")[,1:2]
[1] ‘3.12’

mbkmeans=1.6.1

> packageVersion("mbkmeans")
[1] ‘1.6.1’

I can't use another R version there.

So, I also tried on another machine with R=4.1.2, Bioconductor=3.13.1 and mbkmeans=1.8.0 and got the same results. But I don't know if it is the case for the latest version.

I guess, I can try development version installing from github...

drisso · 2022-02-08T08:40:01Z

Hi @JaneChik

Thanks for sending the data and for further exploring this with randomly generated data. I was able to reproduce your results.

It looks indeed related to repeated data points (at least for the randomly generated data): when you have a cluster made of exactly the same observation repeated multiple times, the WCSS_per_cluster is sometimes (correctly) 0 and sometimes a very large number. My guess is that this is due to numeric errors in R, but I will have to further explore this.

In the real data, it looks like the problems are when there are clusters of only one observation. I will check the code and report back.

As a side note, the data in pca_df don't look very good. Typically, in single-cell RNA-seq you want to perform PCA on log-normalized counts (and perhaps use scale=TRUE in the PCA function), are you doing that here?

Best,
Davide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WCSS scores #17

WCSS scores #17

JaneChik commented Feb 4, 2022 •

edited

Loading

drisso commented Feb 7, 2022

JaneChik commented Feb 7, 2022

drisso commented Feb 8, 2022

WCSS scores #17

WCSS scores #17

Comments

JaneChik commented Feb 4, 2022 • edited Loading

Problem

Parameters of mbkmeans:

What I have tried:

Code for the next plots:

Examples (wcss plots)

Examples (repeated clustering)

drisso commented Feb 7, 2022

JaneChik commented Feb 7, 2022

1. Dataset

2. Random data

3. Versions used

drisso commented Feb 8, 2022

JaneChik commented Feb 4, 2022 •

edited

Loading