Replies: 1 comment
-
|
One idea is to do a shapley value analysis with some faster model like XGBoost, or maybe use XGBoost and find a minimal set of features such that loss still matches the loss given all the features, and then run SR on the features selected from that? You can in principle use PySR on this directly, it's just harder. It's a combinatorics problem after all. So maybe increase the search size (number of populations, size of the populations, maxsize, ncycles_per_iteration, etc.) and run for longer. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I’m using PySR on a problem where I expect the true relationship (known) to have a structured form like
where the g_i are computed feature columns and f_i depend on a small set of derived variables. The issue is that the feature library is highly redundant: many g_i columns are strongly correlated / nearly linearly dependent across samples due to built-in constraints, so there are lots of near-equivalent representations of y.
Empirically, PySR finds a clean expression quickly if I hand it a minimal feature set, but with a larger “unbiased” redundant library it tends to wander into complicated expressions or not converge.
Is this kind of problem well suited to PySR? If so, how could I improve things? If I chuck it on a cluster for a week is that likely to yield good results?
I've tested various amounts of scaling/normalization, pruning via correlation/SVD, and using ExpressionSpec/templates to enforce linearity in a subset of features, but convergence still seems difficult.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions