-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensuring integrity of DataFrames for observed counts and model forecasts #73
base: main
Are you sure you want to change the base?
Conversation
Other ideas of validations:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes me a lot more comfortable with our data objects.
|
||
assert self.REQUIRED_COLUMNS.issubset( | ||
self.columns | ||
), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})" | |
), f"Missing required columns: ({', '.join(req for req in self.REQUIRED_COLUMNS if not req in self.columns)})" |
We could be more descriptive. Not sure it really helps much though
@@ -9,6 +9,49 @@ | |||
from plotnine import aes, geom_line, ggplot, theme_bw | |||
|
|||
|
|||
class ForecastFrame(pl.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we end up with a third one of these, we'll cross the code duplication line and want to refactor these to share a common parent class
|
||
assert self.REQUIRED_COLUMNS.issubset( | ||
self.columns | ||
), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do the same for req in
here if choosing to do it above
).agg(pl.sum("phi")) | ||
|
||
assert ( | ||
(proportion_sums["phi"] - 1).abs() < 1e-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tolerance makes me sad. Does it have to be this big?
from linmod.utils import expand_grid | ||
|
||
|
||
def _generate_fake_samples_and_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meaningfully different from the version in test_eval.py
? Can we move it into a conftest.py
and use it in both places? https://stackoverflow.com/questions/34466027/what-is-conftest-py-for-in-pytest
Also, just to check: have we checked that the pipeline still runs? |
Resolves #67.
Overview
Here we introduce two new classes,
linmod.data.CountsFrame
andlinmod.models.ForecastFrame
, to help us ensure the integrity of our count and proportion DataFrame.These are subclasses of
pl.DataFrame
. Basic usage is to either pass an existing polars DataFrame into the constructor, or use theread_parquet
class method. In either case, validation of the input is automatically performed.Method signatures throughout the codebase have updated type hints, and pytests have been created for the validation routines of these objects.
Validation details
CountsFrame
ensures:CountsFrame.REQUIRED_COLUMNS
)count
column is an integer typeForecastFrame
ensures:ForecastFrame.REQUIRED_COLUMNS
)(sample, date, division)
sum to (roughly) 1