Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GP formulae #290

Merged
merged 25 commits into from
Sep 17, 2024
Merged

GP formulae #290

merged 25 commits into from
Sep 17, 2024

Conversation

jmafoster1
Copy link
Contributor

@jmafoster1 jmafoster1 commented Aug 7, 2024

The primary purpose of this PR is to add Genetic Programming to LinearRegressionEstimator class to infer more complex regression equations from the data, given the features identified from the DAG. For example, identification might give us an equation Y ~ aX + bZ. We could then use GP and the data to infer the relationship Y ~ aX^2 + b log(Z). The features are not different, but the estimation will be much more accurate (and the equational form could be validated by the user).

In addition to this main functionality, I have taken the opportunity to refactor the estimation into a separate package, instead of all being in a single estimators.py file within testing. This is why there are several "new" files. I also had to fix a bunch of other files with the old estimators.py as a dependency.

TLDR; the main contributions are

  • adding GP functionality to infer LR equations from data, given a list of features from causal identification.
  • refactoring estimators.py to have each estimator be in its own separate file within an `estimation' package
  • fixing test/demo code to reflect this

Copy link

github-actions bot commented Aug 7, 2024

🦙 MegaLinter status: ✅ SUCCESS

Descriptor Linter Files Fixed Errors Elapsed time
✅ PYTHON black 37 0 1.02s
✅ PYTHON pylint 37 0 6.23s

See detailed report in MegaLinter reports

MegaLinter is graciously provided by OX Security

@jmafoster1 jmafoster1 marked this pull request as ready for review August 8, 2024 12:50
Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 99.15612% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.07%. Comparing base (4d11764) to head (ed832da).
Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
...stimation/genetic_programming_regression_fitter.py 98.81% 2 Missing ⚠️
...esting/estimation/abstract_regression_estimator.py 97.95% 1 Missing ⚠️
...ausal_testing/estimation/cubic_spline_estimator.py 96.42% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #290      +/-   ##
==========================================
+ Coverage   95.55%   97.07%   +1.51%     
==========================================
  Files          23       30       +7     
  Lines        1666     1809     +143     
==========================================
+ Hits         1592     1756     +164     
+ Misses         74       53      -21     
Files with missing lines Coverage Δ
causal_testing/estimation/abstract_estimator.py 100.00% <100.00%> (ø)
...ting/estimation/instrumental_variable_estimator.py 100.00% <100.00%> (ø)
causal_testing/estimation/ipcw_estimator.py 100.00% <100.00%> (ø)
..._testing/estimation/linear_regression_estimator.py 100.00% <100.00%> (ø)
...esting/estimation/logistic_regression_estimator.py 100.00% <100.00%> (ø)
causal_testing/json_front/json_class.py 98.00% <100.00%> (+0.02%) ⬆️
causal_testing/specification/capabilities.py 100.00% <ø> (ø)
...sal_testing/surrogate/causal_surrogate_assisted.py 100.00% <100.00%> (ø)
...l_testing/surrogate/surrogate_search_algorithms.py 98.46% <100.00%> (ø)
causal_testing/testing/causal_test_adequacy.py 87.17% <100.00%> (ø)
... and 6 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfd0534...ed832da. Read the comment docs.

Copy link
Contributor

@f-allian f-allian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmafoster1 Thanks for this Michael. Could you please add a few more bullet points on what changes this PR specifically includes? I can see there has been a lot added/refactored but I can't quite make sense of why / the benefits of them from your description above.

@jmafoster1
Copy link
Contributor Author

@f-allian sorry for the delay on this. I've been off ill. I've updated the initial comment at the top describing the changes.

Add modelling assumptions to the estimator. This is a list of strings which list the modelling assumptions that
must hold if the resulting causal inference is to be considered valid.
"""
self.modelling_assumptions.append(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@f-allian , I am confused by this, since we instantiate several LinearRegressionEstimators during testing, which is a subclass of this this that calls super().__init__ (i.e. the method defined above), which itself calls its super().__init__ method (from estimator.py), which calls self.add_modelling_assumptions(), so this line should be covered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmafoster1 - I've carried out some tests this morning. We're getting this warning because this method is a duplicate of the same one in linear_regression_estimator.py. Removing it fixes the warning. It makes the most sense to have this method in regression_estimator.py and not linear_regression_estimator.py. Leave this with me and I'll push later today.

I've also got a few more comments I'll add very soon.

x = dmatrix(self.formula.split("~")[1], x, return_type="dataframe")
for col in x:
if str(x.dtypes[col]) == "object":
x = pd.get_dummies(x, columns=[col], drop_first=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to get this properly tested at some point, but it wasn't covered before. It's only flagging it now because I've moved the code around a sufficient amount that it thinks it's new (it's not). However, there's broader issues with this code that need addressing (see #262)

Copy link
Contributor

@f-allian f-allian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmafoster1 I've fixed the code coverage warning (see below for details).

Everything else looks fine overall, but it would be worth giving the estimation scripts more meaning filenames. For instance by changing gp.py to genetic_programming_estimator.py or estimator.py to abstract_estimator.py and so on. This would help things like with navigation, having an appropriate naming convention, and a standardised file structure.

Edit: I'll leave the merging / releasing the new major version with you :-)

Add modelling assumptions to the estimator. This is a list of strings which list the modelling assumptions that
must hold if the resulting causal inference is to be considered valid.
"""
self.modelling_assumptions.append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmafoster1 - I've carried out some tests this morning. We're getting this warning because this method is a duplicate of the same one in linear_regression_estimator.py. Removing it fixes the warning. It makes the most sense to have this method in regression_estimator.py and not linear_regression_estimator.py. Leave this with me and I'll push later today.

I've also got a few more comments I'll add very soon.

@jmafoster1 jmafoster1 merged commit d5f0dad into main Sep 17, 2024
16 checks passed
@jmafoster1 jmafoster1 deleted the gp-formulae branch September 17, 2024 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants