Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preseqR.rSAC does not give the same result as either the ds or ztnb algorithms #25

Open
johan-gson opened this issue Mar 19, 2020 · 6 comments

Comments

@johan-gson
Copy link

As I understand it, this function should pick one of ds or ztnb, whichever fits best. The problem is that if it selects ztnb, it doesn't return the same as ztnb. There is an obvious bug in the code that gives this effect, see below. Perhaps this is meant to be that way?

The functions differ because of a suspected bad variable assignment:

preseqR.rSAC <- function(n, r=1, mt=20, size=SIZE.INIT, mu=MU.INIT)
{
para <- preseqR.ztnb.em(n) ##this call is also different, all params are not sent in
shape <- para$size ##### HERE IT STARTS, THIS IS NOT ASSIGNED TO size
mu <- para$mu

the population is heterogeneous

because the coefficient of variation is large $1 / sqrt(shape)$

if (shape <= 1) {
f.rSAC <- ds.rSAC(n=n, r=r, mt=mt)
} else {
## the population is close to be homogeneous
## the ZTNB approach is applied

## the probability of a species observed in the initial sample
p <- 1 - dnbinom(0, size = size, mu = mu) ########HERE size is used
## L is the estimated number of species in total
L <- sum(as.numeric(n[, 2])) / p
## ZTNB estimator
f.rSAC <- function(t) {
  L * pnbinom(r - 1, size=size, mu=mu*t, lower.tail=FALSE) ########HERE size is used
}

}
return(f.rSAC)
}

@andrewdavidsmith
Copy link
Collaborator

@johan-gson I follow your logic. Did you try making the change?

@johan-gson
Copy link
Author

I tested to create my own function by copying the original one and change it. Disappointingly, it performs worse, even worse than ztnb or ds on their own for my application, so it may be that selection on CV is not ideal in my case. It still looks like a bug compared to what is written in the paper, but I'm not sure what is best to do. I'm using preseq as part of an interesting method for single-cell RNA-Seq; I would like to discuss this with you over Skype if you have the time, I think the prediction could potentially be optimized for this purpose?

@andrewdavidsmith
Copy link
Collaborator

Please contact me directly rather than through GitHub.

@andrewdavidsmith
Copy link
Collaborator

@johan-gson I need a test case that will allow me to see if the size vs. shape are being used properly. Depending on the parameterization, this might easily be correct despite the apparent disagreement in names of variables. The "size" of the ZTNB seems to be the "shape" for the corresponding mixing gamma distribution. Everything points to this parameterization being assumed: the comment about the c_v, the parameterization of dnbinom in the em (at the top of the function), and the "shape" is only used for assessing the c_v. Bugs are easy when different people assume different parameterizations for these distributions, so I'll try to check it.

@johan-gson
Copy link
Author

I can put together a small example for you with some code and data, hopefully later today. I also sent you an email about my other questions.

@johan-gson
Copy link
Author

I sent you a test case on email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants