-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Varsel stat and stratified cross-validation for highly imbalanced data #328
Comments
As stated in #25, AUC has been implemented by #27 and the Brier score is effectively the MSE (see this comment). I'm currently not sure whether the Brier score (MSE) is actually available for the binomial family, i.e., if an error is thrown when trying to select it. If you experience such an error, please report with a reproducible example and then it should be easy to fix (in that case, I'll re-open this issue).
Yes, this is possible via argument projpred/tests/testthat/test_varsel.R Lines 757 to 760 in 33047d6
projpred/tests/testthat/test_varsel.R Lines 863 to 866 in 33047d6
|
Thanks a lot for your answer, Frank. This helps a lot. And apologies, I should have been more explicit in my question or problem statement. I am interested in an evaluation metric that focuses on predicting well the minority class and which is less sensitive to being swamped by good predictions of the majority class. Is the implemented AUC the AUROC (Receiver operating characteristic) which assumes that both classes (minority and majority) are important or the AUPRC (Area under the Precision-Recall Curve) which focuses on classifying correctly the minority class? Or asked differently, which of the currently implemented selection statistics would you recommend for highly imbalanced data, when I am mostly interested in correctly classifying the minority class? Thanks a lot for your help. |
As far as I understand the source code in lines Lines 29 to 54 in 33047d6
Lines 232 to 248 in 33047d6
If it helps: projpred's set.seed(6834)
nobs <- 100L
mu <- binomial()$linkinv(-0.42 + rnorm(nobs))
y <- rbinom(nobs, size = 1, prob = mu)
dat <- data.frame(y = y, mu = mu)
( auc_val_projpred <- projpred:::auc(cbind(y, mu, 1)) )
library(pROC)
auc_val <- auc(y ~ mu, data = dat, direction = "<", algorithm = 1)
( auc_val_pROC <- auc_val[seq_along(auc_val)] ) # Indexing only to drop attributes.
stopifnot(isTRUE(all.equal(auc_val_pROC, auc_val_projpred, tolerance = 1e-20)))
(At that occasion, I'm realizing that the comment in line Line 39 in 33047d6
# false positive weights instead of # true negative weights and that I need to check projpred:::auc() in case of a binomial family with > 1 trials.)
|
Thanks so much for looking further into this. Implementing additional evaluation metrics is likely currently not one of your priorities in further developing projpred but for cases with highly imbalanced data it would be wonderful to, in the future, be able to select metrics like the AUPRC or F1 score, which both focus on predicting correctly the minority class and should be used with highly unbalanced data (see, for instance, Saito & Rehmsmeiner 2015, doi: 10.1371/journal.pone.0118432). I am not sure how often you and other users of projpred come across cases with highly imbalanced data and how difficult it would be to implement additional metrics in projpred. Thank you for your help and understanding, and for developing this great project. Best Andreas |
Yes, thank you for these helpful suggestions. I'm re-opening this issue as a feature request for these additional statistics. |
Excellent. Thanks! |
A short update on this feature request: I have tried to write a new function to calculate the true skill statistic, which is also known as the Peirce score or Hanssen-Kuipers discriminant and copes well with highly imbalanced data. The basic formula is: tss = tpr + tnr - 1 where tpr = true positive rate and tnr the true negative rate. Here is a short code snippet that calculates the tss for simulated data
I tried to implement this function following the logic of the projpred::auc() function but I have struggled to wrap my head around how the weights are implemented and, specifically, how the cumulative sums of weights are used to calculate the different error rates. The auc() function already calculates the tpr and fpr. If the tnr and fnr were added, a whole host of different scores could be calculated (https://en.wikipedia.org/wiki/Sensitivity_and_specificity). Any help would be appreciated. I will have another look at the auc() code next week. Andreas |
Interesting, I know this as Youden's Index (also seems to be called Youden's J statistic).
Indeed, this is not straightforward to see. I recommend to debug set.seed(6834)
nobs <- 100L
mu <- binomial()$linkinv(-0.42 + rnorm(nobs))
y <- rbinom(nobs, size = 1, prob = mu)
auc_val_projpred <- projpred:::auc(cbind(y, mu, 1))
Keep in mind that |
Thanks a lot for the suggestion, Frank. I haven't had the chance to look into this yet. I have been busy trying to implement cv_varsel() with stratified kfold cross-validation with a binomial response variable, as discussed above. I believe it is worth sharing in case other users encounter this issue and that this issue is the appropriate issue given its title. I am using stratified (by y) CV to ensure the few presences in my data are distributed equally across folds. TLDR:
Here is a reproducible example, following the reprex you (Frank) used in #160 but tweaked it slightly to
Using a bernoulli distribution would be the most appropriate given that I am using presence-absence data.
If I instead use a binomial, the appropriate formula specification in brms would be: Using trials(1), however, throws the error:
Currently, the only way to make this approach work seems to be omitting trials(1) and living with the warning from brms:
Just thought I would make you aware of this "issue" and provide a working example to others interested in using stratified CV with a binomial classifier. Cheers |
As Frank pointed out in #352, the bernoulli() distribution can be implemented by using brms:::get_refmodel.brmsfit() instead of init_refmodel(). |
Yes, thank you for updating the issue here, too. And for those stumbling across the issue here, I want to point out that
Yes, this is related to the fact that
Stratifying K-fold CV by the response variable seems a bit odd to me. I think it's meant to be stratified by a predictor variable. But I haven't thought this through yet. It might be a valid procedure. In any case, this is a question related to package loo, not projpred. |
Hi,
I am working on a model with highly imbalanced data which originates from few (as few as 30 observations) observations of species presences and a few thousand randomly selected background observations/pseudo-absences. I would like to use projpred to select the most relevant predictor variables from a pool of about 100 but am unsure whether any of the currently implemented varsel statistics deal well with severe class imbalance. The Brier (Skill) Score and Precision Recall AUC could be useful evaluation statistics, and were suggested by Aki Vehtari in issue #25 but, it seems, not implemented.
Since I would secondly like to cross-validate my variable selection, I was wondering whether there it is possible to somehow stratify the cross-validation procedure to ensure comparable class imbalances across folds? In the worst case some folds will only have (pseudo-) absences and no presences and hence fail.
Thanks so much for your help.
Cheers
Andy
The text was updated successfully, but these errors were encountered: