Ensuring integrity of DataFrames for observed counts and model forecasts #73

thanasibakis · 2025-01-08T22:08:39Z

Resolves #67.

Overview

Here we introduce two new classes, linmod.data.CountsFrame and linmod.models.ForecastFrame, to help us ensure the integrity of our count and proportion DataFrame.

These are subclasses of pl.DataFrame. Basic usage is to either pass an existing polars DataFrame into the constructor, or use the read_parquet class method. In either case, validation of the input is automatically performed.

Method signatures throughout the codebase have updated type hints, and pytests have been created for the validation routines of these objects.

Validation details

CountsFrame ensures:

All required columns are present (see CountsFrame.REQUIRED_COLUMNS)
Null values are not present in any columns
The count column is an integer type

ForecastFrame ensures:

All required columns are present (see ForecastFrame.REQUIRED_COLUMNS)
Lineage proportions for a given (sample, date, division) sum to (roughly) 1

thanasibakis · 2025-01-13T19:27:01Z

Other ideas of validations:

Counts are counts (i.e. integers)
Proportions sum to (roughly) one

afmagee42

This PR makes me a lot more comfortable with our data objects.

afmagee42 · 2025-01-23T22:33:11Z

linmod/data.py

+
+        assert self.REQUIRED_COLUMNS.issubset(
+            self.columns
+        ), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})"


Suggested change

), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})"

), f"Missing required columns: ({', '.join(req for req in self.REQUIRED_COLUMNS if not req in self.columns)})"

We could be more descriptive. Not sure it really helps much though

afmagee42 · 2025-01-23T22:52:48Z

linmod/models/utils.py

@@ -9,6 +9,49 @@
 from plotnine import aes, geom_line, ggplot, theme_bw


+class ForecastFrame(pl.DataFrame):


If we end up with a third one of these, we'll cross the code duplication line and want to refactor these to share a common parent class

afmagee42 · 2025-01-23T22:53:23Z

linmod/models/utils.py

+
+        assert self.REQUIRED_COLUMNS.issubset(
+            self.columns
+        ), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})"


Could do the same for req in here if choosing to do it above

afmagee42 · 2025-01-23T22:54:09Z

linmod/models/utils.py

+        ).agg(pl.sum("phi"))
+
+        assert (
+            (proportion_sums["phi"] - 1).abs() < 1e-3


This tolerance makes me sad. Does it have to be this big?

afmagee42 · 2025-01-23T22:58:13Z

linmod/tests/test_frames.py

+from linmod.utils import expand_grid
+
+
+def _generate_fake_samples_and_data(


Is this meaningfully different from the version in test_eval.py? Can we move it into a conftest.py and use it in both places? https://stackoverflow.com/questions/34466027/what-is-conftest-py-for-in-pytest

afmagee42 · 2025-01-24T14:08:24Z

Also, just to check: have we checked that the pipeline still runs?

thanasibakis added 3 commits January 8, 2025 14:07

Objects for count and forecast frames

398f376

Assert fixes

b45bd40

black and ruff disagreed

13038f1

This was referenced Jan 16, 2025

Validate the join that an Evaluator's constructor performs #24

Open

Unit testing #19

Closed

thanasibakis added 4 commits January 22, 2025 13:47

Test validation methods

eb90d49

More validations

936054d

black format

3e59f57

ruff fix

7abe13b

thanasibakis marked this pull request as ready for review January 22, 2025 23:56

thanasibakis requested review from afmagee42 and swo January 22, 2025 23:56

thanasibakis changed the title ~~Typing and validation of dataframes for observed counts and model forecasts~~ Ensuring integrity of DataFrames for observed counts and model forecasts Jan 22, 2025

Fix test docstring

17da5dd

afmagee42 approved these changes Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring integrity of DataFrames for observed counts and model forecasts #73

Ensuring integrity of DataFrames for observed counts and model forecasts #73

thanasibakis commented Jan 8, 2025 •

edited

Loading

thanasibakis commented Jan 13, 2025

afmagee42 left a comment

afmagee42 Jan 23, 2025

afmagee42 Jan 23, 2025

afmagee42 Jan 23, 2025

afmagee42 Jan 23, 2025

afmagee42 Jan 23, 2025

afmagee42 commented Jan 24, 2025

	), f"Missing at least one required column ({', '.join(self.REQUIRED_COLUMNS)})"
	), f"Missing required columns: ({', '.join(req for req in self.REQUIRED_COLUMNS if not req in self.columns)})"

		@@ -9,6 +9,49 @@
		from plotnine import aes, geom_line, ggplot, theme_bw


		class ForecastFrame(pl.DataFrame):

		from linmod.utils import expand_grid


		def _generate_fake_samples_and_data(

Ensuring integrity of DataFrames for observed counts and model forecasts #73

Are you sure you want to change the base?

Ensuring integrity of DataFrames for observed counts and model forecasts #73

Conversation

thanasibakis commented Jan 8, 2025 • edited Loading

Overview

Validation details

thanasibakis commented Jan 13, 2025

afmagee42 left a comment

Choose a reason for hiding this comment

afmagee42 Jan 23, 2025

Choose a reason for hiding this comment

afmagee42 Jan 23, 2025

Choose a reason for hiding this comment

afmagee42 Jan 23, 2025

Choose a reason for hiding this comment

afmagee42 Jan 23, 2025

Choose a reason for hiding this comment

afmagee42 Jan 23, 2025

Choose a reason for hiding this comment

afmagee42 commented Jan 24, 2025

thanasibakis commented Jan 8, 2025 •

edited

Loading