Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: The least populated class in y has only 2 members, which is less than n_splits=3 #22

Open
FrancescoCasalegno opened this issue Apr 19, 2022 · 0 comments

Comments

@FrancescoCasalegno
Copy link
Contributor

Context & Description

When we run k-fold cross validation, we use n_splits=3.

But for layer L4 and layer L6 of the dataset interneurons, we have classes L4_BP and L6_DBC with 2<3 samples.
Counts per Interneurons subclass

This situation generates the following Python warning when iterating over StratifiedKFold.split(X, y):

UserWarning: The least populated class in y has only 2 members, which is less than n_splits=3.

What happens?

If a class has less members than n_splits, then for some splits of StratifiedKFold we will have no representatives of that class in the validation or in the training set. For instance, splitting [0] * 7 + [1] * 3 with StratifiedKFold(n_splits=3) yields the following training and validation sets, where in split 3 there is no sample of class 1 in the validation set!

    train-set           --- valid-set
[1, 0, 0, 0, 0, 0]      --- [1, 0, 0, 0]   # split 1
[1, 0, 0, 0, 0, 0, 0]   --- [1, 0, 0]      # split 2
[1, 1, 0, 0, 0, 0, 0]   --- [0, 0, 0]      # split 3

Why may this be an issue?

  1. Some metrics may not be computed and/or return wrong values.
    Metrics such as precision_score, recall_score, f1_score cannot be computed if there is no sample for a given class.
    For instance, in the example above, using the validation set of split 3 to compute f1_score for y_pred = [0, 0, 0] will return 0.0 (despite y_pred matching perfectly y_true!!) and raise this warning:
UndefinedMetricWarning: 
Precision is ill-defined and being set to 0.0 due to no predicted samples. 
Use `zero_division` parameter to control this behavior.

However in our case we do not compute metrics per-split and then average across all splits, but instead we take all the out-of-sample predictions (generated during the various splits) and then compute the metric using all samples. Therefore, no class can ever have 0 samples during evaluation.

  1. A class in the validation set may not be present in the training set.
    This would be dramatic, because after the training the model would not even be aware the existence of a class that is however present in the validation set. So it is guaranteed that the model will never predict that class.

However I have never observed this happening on our data. I am not even sure it is possible.

  1. Evaluating on classes with few samples may be not very meaningful.
    Does it really make sense to take into account the model performance with respect to a class that has only 1 member in the training set or in the validation set?

However as long as we look at micro or weighted averages, the impact of (potentially awful) performance on classes with 1 or 2 samples is limited. But this could be bad if we want to look at macro averages. https://github.com/scikit-learn/scikit-learn/blob/baf828ca126bcb2c0ad813226963621cafe38adb/sklearn/metrics/_classification.py#L1049-L1062

How do we solve this?

We could remove or merge classes with less than 3 samples.
But this should be discussed with the scientists. Maybe those classes with few samples are very important and well defined and must be kept anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant