Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensuring integrity of DataFrames for observed counts and model forecasts #73

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

thanasibakis
Copy link
Collaborator

@thanasibakis thanasibakis commented Jan 8, 2025

Resolves #67.

Overview

Here we introduce two new classes, linmod.data.CountsFrame and linmod.models.ForecastFrame, to help us ensure the integrity of our count and proportion DataFrame.

These are subclasses of pl.DataFrame. Basic usage is to either pass an existing polars DataFrame into the constructor, or use the read_parquet class method. In either case, validation of the input is automatically performed.

Method signatures throughout the codebase have updated type hints, and pytests have been created for the validation routines of these objects.

Validation details

CountsFrame ensures:

  • All required columns are present (see CountsFrame.REQUIRED_COLUMNS)
  • Null values are not present in any columns
  • The count column is an integer type

ForecastFrame ensures:

  • All required columns are present (see ForecastFrame.REQUIRED_COLUMNS)
  • Lineage proportions for a given (sample, date, division) sum to (roughly) 1

@thanasibakis
Copy link
Collaborator Author

Other ideas of validations:

  • Counts are counts (i.e. integers)
  • Proportions sum to (roughly) one

@thanasibakis thanasibakis marked this pull request as ready for review January 22, 2025 23:56
@thanasibakis thanasibakis requested review from afmagee42 and swo January 22, 2025 23:56
@thanasibakis thanasibakis changed the title Typing and validation of dataframes for observed counts and model forecasts Ensuring integrity of DataFrames for observed counts and model forecasts Jan 22, 2025
Copy link
Collaborator

@afmagee42 afmagee42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes me a lot more comfortable with our data objects.


assert self.REQUIRED_COLUMNS.issubset(
self.columns
), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})"
), f"Missing required columns: ({', '.join(req for req in self.REQUIRED_COLUMNS if not req in self.columns)})"

We could be more descriptive. Not sure it really helps much though

@@ -9,6 +9,49 @@
from plotnine import aes, geom_line, ggplot, theme_bw


class ForecastFrame(pl.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up with a third one of these, we'll cross the code duplication line and want to refactor these to share a common parent class


assert self.REQUIRED_COLUMNS.issubset(
self.columns
), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do the same for req in here if choosing to do it above

).agg(pl.sum("phi"))

assert (
(proportion_sums["phi"] - 1).abs() < 1e-3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tolerance makes me sad. Does it have to be this big?

from linmod.utils import expand_grid


def _generate_fake_samples_and_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meaningfully different from the version in test_eval.py? Can we move it into a conftest.py and use it in both places? https://stackoverflow.com/questions/34466027/what-is-conftest-py-for-in-pytest

@afmagee42
Copy link
Collaborator

Also, just to check: have we checked that the pipeline still runs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define, document, and assert/test "standard model output format"
2 participants