Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving megastudy zero imputation #32

Merged
merged 8 commits into from
Dec 21, 2023
Merged

Improving megastudy zero imputation #32

merged 8 commits into from
Dec 21, 2023

Conversation

d-callan
Copy link
Contributor

Some things I tried to do, either intentionally or bc they came up along the way of doing other things:

  1. improve test data so it better reflects real use cases. this meant only passing variables that we expect to see in real life, and passing all study vocabs all the time
  2. improve our handling of variable collections some, mostly bc it was doing something slightly unexpected which kept getting me confused during testing and qc
  3. fix a bug where, when multiple study-specific-vocab variables are present, some combinations between those different vocabs are overlooked
  4. make sure this fxn that does the imputation doesnt remove columns from the data table it was passed
  5. make things a bit more readable hopefully

a thing that happened along the way... i am no longer testing variable collections. the test in hindsight wasnt very good anyway. so im not sure were handling variable collections well here, but i figure we can add a better test and fix things as necessary if/ when we get a real use case.

Copy link
Member

@bobular bobular left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look comprehensive! I left a few comments where I wasn't 100% sure. Thanks!

@@ -293,9 +334,9 @@ test_that("imputeZeroes method is sane", {
expect_equal(nrow(imputedDT[imputedDT$sample.specimen_count == 0]), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does imputedDT have a column sample.specimen_count if we didn't ask for it?
If the column doesn't exist, does line 334 return zero and pass the test anyway?

Oh, I see that we kind of half-asked for it! (empty plotReference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the column doesnt exist, the test wont pass. in this case i know the column was there bc its in mComplete@data, and we are no longer removing columns based on their presence in the VariableMetadataList.

dataType = new("DataType", value = 'STRING'),
dataShape = new("DataShape", value = 'CATEGORICAL'),
weightingVariableSpec = VariableSpec(variableId='specimen_count',entityId='sample'),
hasStudyDependentVocabulary = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the test starting at line 383 doing? There's no comment and my spot-the-difference powers are weak!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the one at 341 tests the case where we have a special vocab, a downstream entity and NO specimen_count in the plot. the one at 383 tests the case where we have a special vocab, a downstream entity and also specimen_count in the plot. we said in these cases we would NOT impute zeroes. in the first case (341) bc the variable to impute zeroes on isnt present and in the second case (383) bc the values would be removed later by our requirement for complete cases in the map. if/when the map supports representations of missingness properly, well have to change this test and add support.

@@ -368,7 +423,7 @@ test_that("imputeZeroes method is sane", {
expect_equal(nrow(imputedDT[imputedDT$sample.specimen_count == 0]), 0)

# multiple special vocabs in same plot, w one shared weighting var
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as the test starting at line 155?
Both have species, count, sex and attractant.
Ah, not quite. In the earlier one, sex was "not plotted".

dataType = new("DataType", value = 'STRING'),
dataShape = new("DataShape", value = 'CATEGORICAL'),
weightingVariableSpec = VariableSpec(variableId='specimen_count',entityId='sample'),
hasStudyDependentVocabulary = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the error expected because sex is repeated or because we asked for a mix of sample variables with and without hasStudyDependentVocabulary?

Presumably would we have used sample.eyecolor here if we had it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure tbh. i think maybe i had a stroke while writing this one lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. i removed the duplicate sample.sex reference in the VariableMetadataList. I now expect this to fail bc smaple.sex is present, and claims to have a study specific vocab, but no vocab was provided for it in the megastudy obj

@d-callan d-callan merged commit 0743223 into main Dec 21, 2023
5 checks passed
@d-callan d-callan deleted the all-study-vocabs branch December 21, 2023 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants