Scorers might need to know about training and testing data #3

amueller · 2016-08-22T21:59:34Z

This is not a PR because I didn't write this yet. It's more a very loose RFC.

I think scorers might need to be able to distinguish between training and test data.
I think there were more cases but there are two obvious ones:
the R^2 is currently computed using the test-set mean. That seems really odd, and breaks for LOO.
When doing cross-validation, the classes that are present can change, which can impact things like macro-f1 in weird ways, and can also lead to errors in LOO (scikit-learn/scikit-learn#4546)

I'm not sure if this is a good enough case yet, but I wanted somewhere to take a note ;)

GaelVaroquaux · 2016-08-23T06:11:37Z

Thanks for the note! LOO is really a bad cross-validation strategy [*]. I wonder if we should base our design for it to work, or just push even more for people not to use it. [*] I had an insight yesterday on a simple reason why: the measurement error of the score on the test set goes as sqrt(n_test), as any unbiased statistic. sqrt climbs very fast in the beginning. In this part of the regime, you are better off depleting the train set to benefit from the steep rise.

jnothman · 2016-08-23T06:16:41Z

Is LOO more acceptable when used like some_score(cross_val_predict(X, y,
cv=LOO()), y)?

On 23 August 2016 at 16:11, Gael Varoquaux [email protected] wrote:

Thanks for the note!

LOO is really a bad cross-validation strategy [*]. I wonder if we should
base our design for it to work, or just push even more for people not to
use it.

[*] I had an insight yesterday on a simple reason why: the measurement
error of the score on the test set goes as sqrt(n_test), as any unbiased
statistic. sqrt climbs very fast in the beginning. In this part of the
regime, you are better off depleting the train set to benefit from the
steep rise.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz676LUKPMAknFcLJNcoMKV8ZG160Yks5qio8ZgaJpZM4JqWap
.

GaelVaroquaux · 2016-08-23T06:37:34Z

Is LOO more acceptable when used like some_score(cross_val_predict(X, y, cv=LOO()), y)?

No. I believe that that's actually wrong. You are no longer computing the expectancy of the error of the predictive model. One way of convincing you that you are not computing the same thing is to think of the correlation score: it's quite clear that it can be very different between the two approaches. To convince you that it's the "wrong" thing, I think that the right though to have in mind is that the cross-val score is the expectation on the test data, of the prediction error of the model (formula 1 in http://arxiv.org/pdf/1606.05201.pdf), it's actual a double expectation: if l_M, is the expectation of the error of the model, the score is E[l_M] where the expectation is taken on the data to train the model. http://projecteuclid.org/download/pdfview_1/euclid.ssu/1268143839 has a good analysis of this, including the classic split of l_M in approximation error and estimation error. Using score(cross_val_predict) is not computing that. It's computing the expectation of l_M jointly on the train and test data. Given that the 2 are not independent, it's not the same thing as the successive expectation. Actually, now that I realize it, "cross_val_predict" is probably used massively to compute things that shouldn't be computed.

jnothman · 2016-08-23T06:55:39Z

Thanks for the response.

Yes, the case of correlation (or ROC, or anything where output over samples is compared) is convincing, but not immediately convincing that this issue extends to sample-wise measures.

I'm a bit weak on this theory, but I think I get the picture. I hope I find time to read Arlot and Celisse to solidify it.

And while the proposed intension of cross_val_predict was visualisation, you're probably right that it's licensing some invalid conclusions. :/

amueller · 2016-08-23T16:01:27Z

So the thing is that R^2, our default regression metric, is not a sample
wise measurement.
Also, for ROC curves (and AUC and average precision) there is an issue with
interpolation, which should be done using the training set or a validation
set.
Actually I'm currently not sure what the right way to compute AUC is.

Sent from phone. Please excuse spelling and brevity.

On Aug 23, 2016 02:55, "Joel Nothman" [email protected] wrote:

Thanks for the response.

Yes, the case of correlation (or ROC, or anything where output over
samples is compared) is convincing, but not immediately convincing that
this issue extends to sample-wise measures.

I'm a bit weak on this theory, but I think I get the picture. I hope I
find time to read Arlot and Celisse to solidify it.

And while the proposed intension of cross_val_predict was visualisation,
you're probably right that it's licensing some invalid conclusions. :/

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAbcFklFbA_rosqGw6iLhx_k5v1xuEw8ks5qiplsgaJpZM4JqWap
.

jnothman pushed a commit to jnothman/enhancement_proposals that referenced this issue Aug 17, 2020

Slep005: Some comments on fit params (scikit-learn#3)

79123fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scorers might need to know about training and testing data #3

Scorers might need to know about training and testing data #3

amueller commented Aug 22, 2016

GaelVaroquaux commented Aug 23, 2016 via email

jnothman commented Aug 23, 2016

GaelVaroquaux commented Aug 23, 2016 via email

jnothman commented Aug 23, 2016

amueller commented Aug 23, 2016

Scorers might need to know about training and testing data #3

Scorers might need to know about training and testing data #3

Comments

amueller commented Aug 22, 2016

GaelVaroquaux commented Aug 23, 2016 via email

jnothman commented Aug 23, 2016

GaelVaroquaux commented Aug 23, 2016 via email

jnothman commented Aug 23, 2016

amueller commented Aug 23, 2016