Three changes (one in _glm.py and two in _glm_cv.py) to enable more detailed analysis of CV performance #935

alanchalk · 2025-05-22T13:32:42Z

At the moment:

The deviances in deviance_path_ are too low. This is because w_test (the test weights) are not rescaled to 1.
The predict method of GeneralizedLinearRegressorCV does not work. This seems to be because it inherits from the glm, and in the linear_predictor it uses X @ self.coef_path_[alpha_index]. This is not correct when coef_path_ comes from CV since the first dimension is then the number of folds.
The CV method provides the deviance_path_ for (average) validation performance but not for train performance. Knowing how train and validation performance compare as penalization is reduced, is useful in practice.

To address some of the above, I have made, compiled and tested three changes

Expose the _convert_from_pandas method of _glm.py as convert_from_pandas
Correctly scaled w_test in _glm_cv.py
Added an attribute (self.train_indices_) which is a list of idx_train for each fold

These 3 changes allow the user to create any predictions needed and to create both train and validation curves on the data used for CV. I have a notebook which does this using a version of glum which I have built.

I have not tried to fix the predict method for CV.

A gist which will run on the new build, and which demonstrates use of the changes is at:

https://gist.github.com/alanchalk/cbb68ff9741ec89504d6f21b4b1ff344

(The gist mentions that if you are on a mac you can pip install the revised build from test pypi. If you need windows or linux I can try to add.

…indices in CV\n\n- Make convert_from_pandas public in _glm.py to allow external predictions\n- Scale test weights to 1 in _glm_cv.py to ensure correct test deviance\n- Store train indices from each fold as self.train_indices_ in _glm_cv.py

alanchalk · 2025-05-22T14:41:34Z

Hi, I notice you have been doing some testing. The tests seem to have been written with knowledge of the original bug, for example one of the tests is:

np.testing.assert_allclose(
    np.moveaxis(np.squeeze(glm.deviance_path_), 0, -1), elastic_net.mse_path_ / 5
)

where on the assumption of 5-fold CV, it was known that the result would be 5 times too low. Not sure if this helps?

alanchalk requested review from MarcAntoineSchmidtQC, jtilly and stanmart as code owners May 22, 2025 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Three changes (one in _glm.py and two in _glm_cv.py) to enable more detailed analysis of CV performance #935

Three changes (one in _glm.py and two in _glm_cv.py) to enable more detailed analysis of CV performance #935

Uh oh!

alanchalk commented May 22, 2025

Uh oh!

alanchalk commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Three changes (one in _glm.py and two in _glm_cv.py) to enable more detailed analysis of CV performance #935

Are you sure you want to change the base?

Three changes (one in _glm.py and two in _glm_cv.py) to enable more detailed analysis of CV performance #935

Uh oh!

Conversation

alanchalk commented May 22, 2025

Uh oh!

alanchalk commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant