Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem processing data #2

Open
rchurt opened this issue Jul 19, 2020 · 17 comments
Open

Problem processing data #2

rchurt opened this issue Jul 19, 2020 · 17 comments

Comments

@rchurt
Copy link

rchurt commented Jul 19, 2020

I have some data that I’m trying to run through the estimate.R function, and I keep getting the error: Error in uniroot(fit.epid, lower = res.R$maximum, upper = log(range[2]), : f.lower = f(lower) is NA.

I’m running check.incid and using the output of that as the input to estimate.R. Specifically, I combine the output of check.incid into a named vector. I've saved one example as a csv and attached it. This is what I’m using as the input to estimate.R. You’ll notice that some numbers are fractional—this is because this vector is an average of case numbers across multiple counties.

I’m guessing that the vector isn’t being formatted quite right by check.incid. Do you have any suggestions?

Thanks in advance,
Rob
cases_averaged.csv.zip

@rchurt
Copy link
Author

rchurt commented Jul 19, 2020

Response from Thomas Obadia:

If memory serves right, uniroot() is called when profiling the likelihood with the ML method, and it seems that range if NA, for some reason (by default it should be 0.01–50). I’m thinking the non-integer counts are somewhat causing that issue.

The ML method assumes Poisson distribution of secondary cases and the likelihood is written as such. When running some debug, here’s what I run into (after toying around):

50: In dpois(epid$incid[2:T] - import[2:T], lambda = sim.epid,  ... :
  non-integer x = 157.352941

A few thoughts: there’s no need to pass the result to check.incid() to estimate.R(). Whenever an internal method is used (est.R0.ML(), .EG()…) check.incid is systematically called, except when specifically ignored (with the checked flag set to FALSE). Although I understand the idea of averaging cases over multiple counties, it’s not compatible with Poisson likelihood here (and any method expecting incidence count, incidentally). A first rough attempt would be to just round the vector and see what happens. Another concern in that specific dataset that you shared is the very long left-trail of zeros, which requires tweaking begin/end of the curve to pass to a parametric likelihood.

Here’s what I’m able to reach, with a very quick attempt and using under-the-hood method instead of the estimate.R “main” function :

mGT = generation.time("gamma", val = c(5, 10))
est.R0.ML(epid = round(tmp), GT = mGT, methods = "ML", checked = T, unknown.GT =T, begin="2020-03-14", end="2020-07-16")
Reproduction number estimate using  Maximum Likelihood  method.
R :  1.14747[ 1.129204 , 1.165909 ]

Note that it’s a very quick attempt and the generation time is taken completely at random. I did read a few early CDC reports that provided some estimates of the Generation Time, but they were really early in the epidemic and there’s probably better data now. Hence why I passed unknown.GT = TRUE, here, for uncertainty.

@tobadia
Copy link
Owner

tobadia commented Jul 21, 2020

Thanks for following-up here and tidying up my answer a bit, makes life easier!

The code just above was a quick attempt directly calling under-the-hood functions. Here there are two key issues, ultimately:

  1. Non-integer counts, that can't fit any Poisson likelihood
  2. Generation time distribution to pick properly (+ passing the unknown.GT = TRUE flag to try and estimate a generation time that fits the data. Too irrealistic GT distributions would end-up in the likelihood not being properly evaluated

As a side issue, the trail of zeros at the beginning of the incidence count vector also makes the function difficult to evaluate and should be discarded.

The "user-friendly" version of my code above can be simplified as:

# Load dataset as a named vector
incid <- read.table("/path/to/incidence/count/attached/to/initial/post.csv", header=T, sep = ",")
incid <- setNames(incid$x, row.names(incid))
# Use a not-too-bad generation time
mGT <- generation.time("gamma", val = c(5, 10))
# Estimate R and GT jointly from data, assuming Poisson distirbution and strating at first non-0 day
est.R <- estimate.R(epid = round(incid), GT = mGT, methods = c("ML"), checked = T, unknown.GT =T, begin = "2020-03-14", end = "2020-07-16")

Note that the optimized GT disitrbution can then be found in est.R$estimates$ML$GT and differs from the initial Gamma (mean = 5, sd = 10):

> est.R$GT
Discretized Generation Time distribution
mean: 8.752925 , sd: 11.63884

@rchurt
Copy link
Author

rchurt commented Jul 21, 2020

Happy to, and thanks so much for the help!

That code works in my hands too (see Google Colab notebook I just made a PR for). I'm wondering whether it would be worth adding some cleanup functions to the package to make data compatible with the R0 estimation functions. Also adding specific errors if the data simply isn't appropriate for R0 estimation.

Ideas include:

  • Round entries
incid <- round(incid)
  • Remove NAs
incid <- incid[!is.na(incid)]
  • Remove leading or trailing zeros
incid <- incid[min(which(incid != 0)) : max(which(incid != 0))]
  • Raise an error if incid doesn't have more than a certain number of total cases (10?) or days of data (2?), or if more than a certain percentage of the entries in the vector are 0 (30%?)
  • Raise an error if incid is constant the whole time
var(incid) == 0

If you'd like, I can share a larger COVID dataset that isn't as well-behaved as the column I sent, and we can try to add functions that address the problems. Up to you, but happy to help!

@tobadia
Copy link
Owner

tobadia commented Jul 24, 2020

check.incid() was implemented with that idea in mind, but is likely too restrictive. And also, having it called everytime an estimation routine is used prevent it from being too interactive.

I like the idea of an external cleanup/inspect function that would warn the user if some specific edge cases are encountered. That could be a function not called as part of any estimation routine, e.g. some inspect.data(). It could still call check.incid() inside to show what it'd return, but would also inspect what you've suggested and possibly more cases. Will work on that as it will benefit everyone.

@rchurt
Copy link
Author

rchurt commented Jul 24, 2020

That sounds great! Let me know if there's anything I can do to help :)

@rchurt
Copy link
Author

rchurt commented Jul 25, 2020

Ran into another issue with processing this dataset: I had initially (mistakenly) been feeding the accumulated case numbers into est.R0.TD as incid, but I realized that I should be using the number of new cases per day instead. So I made a new incid by subtracting each day's accumulated case number from the previous day's ("real_cases" column of attached), but when I ran it through the command below, I got mostly NAs for R(t). Have you seen this kind of behavior before?

mGT <- generation.time("gamma", c(4, 5))
est.R0.TD(incid, mGT, unknown.GT=T, nsim=10000)

Wyoming_R_t.csv.zip

@tobadia
Copy link
Owner

tobadia commented Jul 28, 2020

Amazing, you've been digging this much more than I have been over the past few days. Baby + normal work + maintaining code from years ago = challenging. But I'll get there!

I'm planning to fit all we've discussed here into a inspect.data()-themed function sometimes this week or maybe over the week-end. I'll try and keep an updated list of suggestions in thsi thread (and hopefully this post here).

Ideas:

  • inspect.data() would return a cleansed array, with format strictly identical to input that can hopefully be passed to estimate.R() and avoid user running into issues discussed here. Another option could be not to return anything and just suggest the user to clean the data according to warning. This is more conservative but I like the idea of user-led data cleaning rather than automated cleanup...
  • Non-integer data: yield a warning and round
  • NA's: double-check what happens when they are in-between data (drop? ignore? do they cause harm somewhere?). Trim leading/trailing + yield warning
  • 0's : remove leading/trailing + yield warning + indicate that the same could have been achieved by properly setting begin and and flags in estimate.R()
  • scarce data: yield an warning (error?) if length of incid is shorter than length of generation time
  • scarce data: yield an warning (error?) if total cumulative number of cases doesn't allow to cover at least 50% (?) of the generation time distribution starting at 1st non-0 index
  • incorrect data: try and detect cumulative incidence instead of time-series. Search for concavity e.g. all points between first-non-0 and end above linear interpolation from same index + yield warning and suggest diff() to extract incidence
  • incorrect data: investigate what happens with constant incidence

@rchurt
Copy link
Author

rchurt commented Jul 28, 2020

Yeah, I'm amazed you've been able to respond at all!

This sounds great—then I'll try to break inspect.data() with my weird data :)

@rchurt
Copy link
Author

rchurt commented Aug 27, 2020

Hi @tobadia, just wanted to check in to see how things are going with this. No rush, and thanks again for your help!

@tobadia
Copy link
Owner

tobadia commented Aug 27, 2020

Hey @rchurt,
I did start coding and have some stuffs nearly ready, but in-between, holidays kicked in and work remained on the side. I have less time than I'd have hoped on my spare time. I'll be back working next week and hopefully can commit stuffs then. Cheers!

@rchurt
Copy link
Author

rchurt commented Aug 27, 2020

No problem, and thanks for finding time given how busy you are!

@tobadia
Copy link
Owner

tobadia commented Sep 4, 2020

Hey @rchurt !
Just commited a devel-inspect branch that implements an attempt at inspect.data(). Feel free to toy around. I haven't included a check for concavity (cumulative incidence) as initially thought because it is way less obvious than what i had in mind. By any chance, could you share the data before the one you precessed into "real_cases" ? If I'm not mistaken, it translates into something that's near-to-linear... which could be the early stages of an exponential (when cases are very few).

I had in mind calculating the slope of the data with increasing number of points, but I don't think it'd help much there.

Please go ahead and install the devel version with the following (requires the devtools library):
devtools::install_github("tobadia/R0", ref = "devel-inspect")

In the end, inspect.data() doesn't return anything: it just yields warnings in which suggestions are made. I'll enhance the phrasing with your feedback and possibly that from other users. Once this thing is decently shaped, I'll merge it into the master branch and move on to more thorough updates. There's a lot of cleanup to do for displaying graphics, which will be the next big thing.

@rchurt
Copy link
Author

rchurt commented Sep 4, 2020

Awesome! Let me play around with this and get back to you. In the meantime, here are the raw and processed data you mentioned. Thanks!

Archive.zip

@rchurt
Copy link
Author

rchurt commented Sep 8, 2020

Hi @tobadia,

Sorry for the basic question, but when I try to do the above, I get the error: Error in inspect.data(): could not find function "inspect.data". It seems that the download is working, but the function still isn't found.

To reproduce my problem exactly, you can open a new R notebook in Colab and run:

devtools::install_github("tobadia/R0", ref = "devel-inspect")

inspect.data()

Have you run into this before?

@tobadia
Copy link
Owner

tobadia commented Sep 9, 2020

Indeed. Fixed, I had forgotten that methods should be exported in the NAMESPACE. Should be working with the most recent commit.

@rchurt
Copy link
Author

rchurt commented Sep 19, 2020

Hi @tobadia,
I'm sure I'm doing something silly, but I'm still getting the same error about not finding the function. If you try to run it in Colab as above, does it work for you?

@tobadia
Copy link
Owner

tobadia commented Sep 23, 2020

Hey @rchurt
There's definitely something missing: the library wasn't loaded in your code above.

The code below does work in the Google collab notebook (I had never used this before and it looks like a very nice way to distribute repliable examples!)

devtools::install_github("tobadia/R0", ref = "devel-inspect")
library(R0)
inspect.data()

That should do the trick. Now, onto breaking things :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants