GP formulae #290

jmafoster1 · 2024-08-07T09:50:35Z

The primary purpose of this PR is to add Genetic Programming to LinearRegressionEstimator class to infer more complex regression equations from the data, given the features identified from the DAG. For example, identification might give us an equation Y ~ aX + bZ. We could then use GP and the data to infer the relationship Y ~ aX^2 + b log(Z). The features are not different, but the estimation will be much more accurate (and the equational form could be validated by the user).

In addition to this main functionality, I have taken the opportunity to refactor the estimation into a separate package, instead of all being in a single estimators.py file within testing. This is why there are several "new" files. I also had to fix a bunch of other files with the old estimators.py as a dependency.

TLDR; the main contributions are

adding GP functionality to infer LR equations from data, given a list of features from causal identification.
refactoring estimators.py to have each estimator be in its own separate file within an `estimation' package
fixing test/demo code to reflect this

github-actions · 2024-08-07T09:51:35Z

🦙 MegaLinter status: ✅ SUCCESS

Descriptor	Linter	Files	Fixed	Errors	Elapsed time
✅ PYTHON	black	37		0	1.02s
✅ PYTHON	pylint	37		0	6.23s

See detailed report in MegaLinter reports

MegaLinter is graciously provided by

…gistic Regression Estimator classes.

codecov · 2024-08-08T13:32:31Z

Codecov Report

Attention: Patch coverage is 99.15612% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.07%. Comparing base (4d11764) to head (ed832da).
Report is 27 commits behind head on main.

Files with missing lines	Patch %	Lines
...stimation/genetic_programming_regression_fitter.py	98.81%	2 Missing ⚠️
...esting/estimation/abstract_regression_estimator.py	97.95%	1 Missing ⚠️
...ausal_testing/estimation/cubic_spline_estimator.py	96.42%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #290      +/-   ##
==========================================
+ Coverage   95.55%   97.07%   +1.51%     
==========================================
  Files          23       30       +7     
  Lines        1666     1809     +143     
==========================================
+ Hits         1592     1756     +164     
+ Misses         74       53      -21

Files with missing lines	Coverage Δ
causal_testing/estimation/abstract_estimator.py	`100.00% <100.00%> (ø)`
...ting/estimation/instrumental_variable_estimator.py	`100.00% <100.00%> (ø)`
causal_testing/estimation/ipcw_estimator.py	`100.00% <100.00%> (ø)`
..._testing/estimation/linear_regression_estimator.py	`100.00% <100.00%> (ø)`
...esting/estimation/logistic_regression_estimator.py	`100.00% <100.00%> (ø)`
causal_testing/json_front/json_class.py	`98.00% <100.00%> (+0.02%)`	⬆️
causal_testing/specification/capabilities.py	`100.00% <ø> (ø)`
...sal_testing/surrogate/causal_surrogate_assisted.py	`100.00% <100.00%> (ø)`
...l_testing/surrogate/surrogate_search_algorithms.py	`98.46% <100.00%> (ø)`
causal_testing/testing/causal_test_adequacy.py	`87.17% <100.00%> (ø)`
... and 6 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfd0534...ed832da. Read the comment docs.

…they're not statistically meaningful

…rk into gp-formulae

f-allian

@jmafoster1 Thanks for this Michael. Could you please add a few more bullet points on what changes this PR specifically includes? I can see there has been a lot added/refactored but I can't quite make sense of why / the benefits of them from your description above.

jmafoster1 · 2024-09-11T07:40:06Z

@f-allian sorry for the delay on this. I've been off ill. I've updated the initial comment at the top describing the changes.

jmafoster1 · 2024-09-11T08:12:51Z

causal_testing/estimation/regression_estimator.py

+        Add modelling assumptions to the estimator. This is a list of strings which list the modelling assumptions that
+        must hold if the resulting causal inference is to be considered valid.
+        """
+        self.modelling_assumptions.append(


@f-allian , I am confused by this, since we instantiate several LinearRegressionEstimators during testing, which is a subclass of this this that calls super().__init__ (i.e. the method defined above), which itself calls its super().__init__ method (from estimator.py), which calls self.add_modelling_assumptions(), so this line should be covered?

@jmafoster1 - I've carried out some tests this morning. We're getting this warning because this method is a duplicate of the same one in linear_regression_estimator.py. Removing it fixes the warning. It makes the most sense to have this method in regression_estimator.py and not linear_regression_estimator.py. Leave this with me and I'll push later today.

I've also got a few more comments I'll add very soon.

jmafoster1 · 2024-09-11T08:24:09Z

causal_testing/estimation/regression_estimator.py

+        x = dmatrix(self.formula.split("~")[1], x, return_type="dataframe")
+        for col in x:
+            if str(x.dtypes[col]) == "object":
+                x = pd.get_dummies(x, columns=[col], drop_first=True)


It would be good to get this properly tested at some point, but it wasn't covered before. It's only flagging it now because I've moved the code around a sufficient amount that it thinks it's new (it's not). However, there's broader issues with this code that need addressing (see #262)

f-allian

@jmafoster1 I've fixed the code coverage warning (see below for details).

Everything else looks fine overall, but it would be worth giving the estimation scripts more meaning filenames. For instance by changing gp.py to genetic_programming_estimator.py or estimator.py to abstract_estimator.py and so on. This would help things like with navigation, having an appropriate naming convention, and a standardised file structure.

Edit: I'll leave the merging / releasing the new major version with you :-)

f-allian · 2024-09-13T11:15:20Z

causal_testing/estimation/regression_estimator.py

+        Add modelling assumptions to the estimator. This is a list of strings which list the modelling assumptions that
+        must hold if the resulting causal inference is to be considered valid.
+        """
+        self.modelling_assumptions.append(


@jmafoster1 - I've carried out some tests this morning. We're getting this warning because this method is a duplicate of the same one in linear_regression_estimator.py. Removing it fixes the warning. It makes the most sense to have this method in regression_estimator.py and not linear_regression_estimator.py. Leave this with me and I'll push later today.

I've also got a few more comments I'll add very soon.

jmafoster1 added 3 commits August 6, 2024 15:13

GP in

790c12b

Cleaned estimators a bit

c6f9d31

Moved estimation to a separate package

a13d83b

jmafoster1 added 7 commits August 7, 2024 15:02

pylint

3e25256

pytest

62e6b3d

pylint

5d915ed

pylint

5b0661d

all tests pass

e7deb78

RegressionEstimator class to combine common elements of Linear and Lo…

05c5499

…gistic Regression Estimator classes.

Pylint

03f1bd8

jmafoster1 marked this pull request as ready for review August 8, 2024 12:50

Seeding gp power

968fc89

jmafoster1 and others added 9 commits August 8, 2024 15:05

Removed ate and risk ratio from logistic regression estimator, since …

c0fbbdd

…they're not statistically meaningful

removed unnecessary raise NotImplementedError

b157583

pylint

5886480

Merge branch 'main' of github.com:CITCOM-project/CausalTestingFramewo…

7f2f8ae

…rk into gp-formulae

pyproject

c66bb15

pyproject

232cde2

fixed pylintrc for deprecated exceptions

1d6b712

black

2a7ce30

pylint

673165c

jmafoster1 requested review from f-allian, rsomers1998 and christopher-wild September 4, 2024 14:00

f-allian reviewed Sep 5, 2024

View reviewed changes

jmafoster1 commented Sep 11, 2024

View reviewed changes

jmafoster1 and others added 4 commits September 11, 2024 09:27

increased coverage a little

6efe594

actions/upload-artifact updated to v3

47fc503

deleted estimators.py

9e2b4ba

fix: code coverage

97fb6e7

f-allian approved these changes Sep 13, 2024

View reviewed changes

file renameing as suggested by @f-allian

ed832da

jmafoster1 merged commit d5f0dad into main Sep 17, 2024
16 checks passed

jmafoster1 deleted the gp-formulae branch September 17, 2024 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GP formulae #290

GP formulae #290

jmafoster1 commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024 •

edited

Loading

codecov bot commented Aug 8, 2024 •

edited

Loading

f-allian left a comment

jmafoster1 commented Sep 11, 2024

jmafoster1 Sep 11, 2024

f-allian Sep 13, 2024

jmafoster1 Sep 11, 2024

f-allian left a comment •

edited

Loading

f-allian Sep 13, 2024

GP formulae #290

GP formulae #290

Conversation

jmafoster1 commented Aug 7, 2024 • edited Loading

github-actions bot commented Aug 7, 2024 • edited Loading

🦙 MegaLinter status: ✅ SUCCESS

codecov bot commented Aug 8, 2024 • edited Loading

Codecov Report

f-allian left a comment

Choose a reason for hiding this comment

jmafoster1 commented Sep 11, 2024

jmafoster1 Sep 11, 2024

Choose a reason for hiding this comment

f-allian Sep 13, 2024

Choose a reason for hiding this comment

jmafoster1 Sep 11, 2024

Choose a reason for hiding this comment

f-allian left a comment • edited Loading

Choose a reason for hiding this comment

f-allian Sep 13, 2024

Choose a reason for hiding this comment

jmafoster1 commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024 •

edited

Loading

codecov bot commented Aug 8, 2024 •

edited

Loading

f-allian left a comment •

edited

Loading