-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GP formulae #290
GP formulae #290
Conversation
🦙 MegaLinter status: ✅ SUCCESS
See detailed report in MegaLinter reports |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #290 +/- ##
==========================================
+ Coverage 95.55% 97.07% +1.51%
==========================================
Files 23 30 +7
Lines 1666 1809 +143
==========================================
+ Hits 1592 1756 +164
+ Misses 74 53 -21
Continue to review full report in Codecov by Sentry.
|
…they're not statistically meaningful
…rk into gp-formulae
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmafoster1 Thanks for this Michael. Could you please add a few more bullet points on what changes this PR specifically includes? I can see there has been a lot added/refactored but I can't quite make sense of why / the benefits of them from your description above.
@f-allian sorry for the delay on this. I've been off ill. I've updated the initial comment at the top describing the changes. |
Add modelling assumptions to the estimator. This is a list of strings which list the modelling assumptions that | ||
must hold if the resulting causal inference is to be considered valid. | ||
""" | ||
self.modelling_assumptions.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@f-allian , I am confused by this, since we instantiate several LinearRegressionEstimator
s during testing, which is a subclass of this this that calls super().__init__
(i.e. the method defined above), which itself calls its super().__init__
method (from estimator.py
), which calls self.add_modelling_assumptions()
, so this line should be covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmafoster1 - I've carried out some tests this morning. We're getting this warning because this method is a duplicate of the same one in linear_regression_estimator.py
. Removing it fixes the warning. It makes the most sense to have this method in regression_estimator.py
and not linear_regression_estimator.py
. Leave this with me and I'll push later today.
I've also got a few more comments I'll add very soon.
x = dmatrix(self.formula.split("~")[1], x, return_type="dataframe") | ||
for col in x: | ||
if str(x.dtypes[col]) == "object": | ||
x = pd.get_dummies(x, columns=[col], drop_first=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to get this properly tested at some point, but it wasn't covered before. It's only flagging it now because I've moved the code around a sufficient amount that it thinks it's new (it's not). However, there's broader issues with this code that need addressing (see #262)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmafoster1 I've fixed the code coverage warning (see below for details).
Everything else looks fine overall, but it would be worth giving the estimation scripts more meaning filenames. For instance by changing gp.py
to genetic_programming_estimator.py
or estimator.py
to abstract_estimator.py
and so on. This would help things like with navigation, having an appropriate naming convention, and a standardised file structure.
Edit: I'll leave the merging / releasing the new major version with you :-)
Add modelling assumptions to the estimator. This is a list of strings which list the modelling assumptions that | ||
must hold if the resulting causal inference is to be considered valid. | ||
""" | ||
self.modelling_assumptions.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmafoster1 - I've carried out some tests this morning. We're getting this warning because this method is a duplicate of the same one in linear_regression_estimator.py
. Removing it fixes the warning. It makes the most sense to have this method in regression_estimator.py
and not linear_regression_estimator.py
. Leave this with me and I'll push later today.
I've also got a few more comments I'll add very soon.
The primary purpose of this PR is to add Genetic Programming to
LinearRegressionEstimator
class to infer more complex regression equations from the data, given the features identified from the DAG. For example, identification might give us an equationY ~ aX + bZ
. We could then use GP and the data to infer the relationshipY ~ aX^2 + b log(Z)
. The features are not different, but the estimation will be much more accurate (and the equational form could be validated by the user).In addition to this main functionality, I have taken the opportunity to refactor the estimation into a separate package, instead of all being in a single
estimators.py
file withintesting
. This is why there are several "new" files. I also had to fix a bunch of other files with the oldestimators.py
as a dependency.TLDR; the main contributions are
estimators.py
to have each estimator be in its own separate file within an `estimation' package