Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Varsel stat and stratified cross-validation for highly imbalanced data #328

Open
adietzel opened this issue Jul 5, 2022 · 11 comments
Open
Labels
enhancement Enhancements of existing features, but also new feature requests.

Comments

@adietzel
Copy link

adietzel commented Jul 5, 2022

Hi,

I am working on a model with highly imbalanced data which originates from few (as few as 30 observations) observations of species presences and a few thousand randomly selected background observations/pseudo-absences. I would like to use projpred to select the most relevant predictor variables from a pool of about 100 but am unsure whether any of the currently implemented varsel statistics deal well with severe class imbalance. The Brier (Skill) Score and Precision Recall AUC could be useful evaluation statistics, and were suggested by Aki Vehtari in issue #25 but, it seems, not implemented.

Since I would secondly like to cross-validate my variable selection, I was wondering whether there it is possible to somehow stratify the cross-validation procedure to ensure comparable class imbalances across folds? In the worst case some folds will only have (pseudo-) absences and no presences and hence fail.

Thanks so much for your help.
Cheers

Andy

@fweber144
Copy link
Collaborator

The Brier (Skill) Score and Precision Recall AUC could be useful evaluation statistics, and were suggested by Aki Vehtari in issue #25 but, it seems, not implemented.

As stated in #25, AUC has been implemented by #27 and the Brier score is effectively the MSE (see this comment). I'm currently not sure whether the Brier score (MSE) is actually available for the binomial family, i.e., if an error is thrown when trying to select it. If you experience such an error, please report with a reproducible example and then it should be easy to fix (in that case, I'll re-open this issue).

Since I would secondly like to cross-validate my variable selection, I was wondering whether there it is possible to somehow stratify the cross-validation procedure to ensure comparable class imbalances across folds?

Yes, this is possible via argument cvfits of init_refmodel(). Perhaps loo::kfold_split_grouped() and related functions (see ?loo::kfold_split_grouped) help in constructing the folds. See e.g.

test_that(paste(
"`cvfits` (actually passed to init_refmodel()) works for rstanarm reference",
"models"
), {
or
test_that(paste(
"`cvfits` (actually passed to init_refmodel()) works for brms reference",
"models"
), {

@adietzel
Copy link
Author

adietzel commented Jul 6, 2022

Thanks a lot for your answer, Frank. This helps a lot.

And apologies, I should have been more explicit in my question or problem statement. I am interested in an evaluation metric that focuses on predicting well the minority class and which is less sensitive to being swamped by good predictions of the majority class. Is the implemented AUC the AUROC (Receiver operating characteristic) which assumes that both classes (minority and majority) are important or the AUPRC (Area under the Precision-Recall Curve) which focuses on classifying correctly the minority class?

Or asked differently, which of the currently implemented selection statistics would you recommend for highly imbalanced data, when I am mostly interested in correctly classifying the minority class?

Thanks a lot for your help.
Best
Andreas

@fweber144
Copy link
Collaborator

As far as I understand the source code in lines

projpred/R/misc.R

Lines 29 to 54 in 33047d6

auc <- function(x) {
resp <- x[, 1]
pred <- x[, 2]
weights <- x[, 3]
n <- nrow(x)
ord <- order(pred, decreasing = TRUE)
resp <- resp[ord]
pred <- pred[ord]
weights <- weights[ord]
w0 <- w1 <- weights
w0[resp == 1] <- 0 # true negative weights
w1[resp == 0] <- 0 # true positive weights
cum_w0 <- cumsum(w0)
cum_w1 <- cumsum(w1)
## ignore tied predicted probabilities, keeping only the rightmost one
rightmost.prob <- c(diff(pred) != 0, TRUE)
fpr <- c(0, cum_w0[rightmost.prob]) / cum_w0[n]
tpr <- c(0, cum_w1[rightmost.prob]) / cum_w1[n]
delta_fpr <- c(diff(fpr), 0)
delta_tpr <- c(diff(tpr), 0)
## sum the area of the rectangles that fall completely below the ROC curve
## plus half the area of the rectangles that are cut in two by the curve
return(sum(delta_fpr * tpr) + sum(delta_fpr * delta_tpr) / 2)
}
and lines

projpred/R/summary_funs.R

Lines 232 to 248 in 33047d6

} else if (stat == "auc") {
y <- d_test$y
auc.data <- cbind(y, mu, weights)
if (!is.null(mu.bs)) {
mu.bs[is.na(mu)] <- NA # compute the relative auc using only those points
mu[is.na(mu.bs)] <- NA # for which both mu and mu.bs are non-NA
auc.data.bs <- cbind(y, mu.bs, weights)
value <- auc(auc.data) - auc(auc.data.bs)
value.bootstrap1 <- bootstrap(auc.data, auc, ...)
value.bootstrap2 <- bootstrap(auc.data.bs, auc, ...)
value.se <- sd(value.bootstrap1 - value.bootstrap2, na.rm = TRUE)
} else {
value <- auc(auc.data)
value.bootstrap <- bootstrap(auc.data, auc, ...)
value.se <- sd(value.bootstrap, na.rm = TRUE)
}
}
as well as your explanations, I think it is the former of the two definitions, i.e.

the AUROC (Receiver operating characteristic) which assumes that both classes (minority and majority) are important

If it helps: projpred's projpred:::auc() implementation seems to give the same results as pROC::auc():

set.seed(6834)
nobs <- 100L
mu <- binomial()$linkinv(-0.42 + rnorm(nobs))
y <- rbinom(nobs, size = 1, prob = mu)
dat <- data.frame(y = y, mu = mu)

( auc_val_projpred <- projpred:::auc(cbind(y, mu, 1)) )
library(pROC)
auc_val <- auc(y ~ mu, data = dat, direction = "<", algorithm = 1)
( auc_val_pROC <- auc_val[seq_along(auc_val)] ) # Indexing only to drop attributes.
stopifnot(isTRUE(all.equal(auc_val_pROC, auc_val_projpred, tolerance = 1e-20)))

(At that occasion, I'm realizing that the comment in line

w0[resp == 1] <- 0 # true negative weights
should read # false positive weights instead of # true negative weights and that I need to check projpred:::auc() in case of a binomial family with > 1 trials.)

@adietzel
Copy link
Author

adietzel commented Jul 7, 2022

Thanks so much for looking further into this.
The AUC implemented in projpred does indeed seem to be the AUROC.

Implementing additional evaluation metrics is likely currently not one of your priorities in further developing projpred but for cases with highly imbalanced data it would be wonderful to, in the future, be able to select metrics like the AUPRC or F1 score, which both focus on predicting correctly the minority class and should be used with highly unbalanced data (see, for instance, Saito & Rehmsmeiner 2015, doi: 10.1371/journal.pone.0118432).

I am not sure how often you and other users of projpred come across cases with highly imbalanced data and how difficult it would be to implement additional metrics in projpred.

Thank you for your help and understanding, and for developing this great project.

Best

Andreas

@fweber144
Copy link
Collaborator

Implementing additional evaluation metrics is likely currently not one of your priorities in further developing projpred but for cases with highly imbalanced data it would be wonderful to, in the future, be able to select metrics like the AUPRC or F1 score, which both focus on predicting correctly the minority class and should be used with highly unbalanced data (see, for instance, Saito & Rehmsmeiner 2015, doi: 10.1371/journal.pone.0118432).

Yes, thank you for these helpful suggestions. I'm re-opening this issue as a feature request for these additional statistics.

@fweber144 fweber144 reopened this Jul 10, 2022
@adietzel
Copy link
Author

Excellent. Thanks!

@adietzel
Copy link
Author

adietzel commented Sep 9, 2022

A short update on this feature request: I have tried to write a new function to calculate the true skill statistic, which is also known as the Peirce score or Hanssen-Kuipers discriminant and copes well with highly imbalanced data. The basic formula is:

tss = tpr + tnr - 1

where tpr = true positive rate and tnr the true negative rate.

Here is a short code snippet that calculates the tss for simulated data

# Simulate data
data <- data.frame(resp = rbinom(100, 1, 0.6),
                   pred = rbinom(100, 1, 0.6),
                   weights = rep(1, 100))

# Calcualte true positive rate
data$tp <- data$resp * data$pred
tpr <- sum(data$tp) / sum(data$resp) # tpr = tp / (tp + fn)
tpr

# Calculate true negative rate
data$tn <- ifelse(data$resp == 0 & data$pred == 0, 1, 0)
tnr <- sum(data$tn) / length(data$resp == 0) # tnr = tn / (fp + tn)
tnr

# Calculate True-Skill Statistic (Peirce score)
tss <- tpr + tnr - 1
tss

I tried to implement this function following the logic of the projpred::auc() function but I have struggled to wrap my head around how the weights are implemented and, specifically, how the cumulative sums of weights are used to calculate the different error rates. The auc() function already calculates the tpr and fpr. If the tnr and fnr were added, a whole host of different scores could be calculated (https://en.wikipedia.org/wiki/Sensitivity_and_specificity).

Any help would be appreciated. I will have another look at the auc() code next week.
Cheers

Andreas

@fweber144
Copy link
Collaborator

tss = tpr + tnr - 1

Interesting, I know this as Youden's Index (also seems to be called Youden's J statistic).

I have struggled to wrap my head around how the weights are implemented and, specifically, how the cumulative sums of weights are used to calculate the different error rates.

Indeed, this is not straightforward to see. I recommend to debug projpred:::auc() line-by-line using a simple example, e.g., the one from #329, slightly modified:

set.seed(6834)
nobs <- 100L
mu <- binomial()$linkinv(-0.42 + rnorm(nobs))
y <- rbinom(nobs, size = 1, prob = mu)

auc_val_projpred <- projpred:::auc(cbind(y, mu, 1))

The auc() function already calculates the tpr and fpr. If the tnr and fnr were added, a whole host of different scores could be calculated (https://en.wikipedia.org/wiki/Sensitivity_and_specificity).

Keep in mind that projpred:::auc() does not calculate the TPR or FPR, but instead one value per "classification rule", where "classification rule" means to classify all observations as "outcome = 1" for which the predicted probability for "outcome = 1" exceeds (or is equal to) a specific value from mu. (For the AUC, the classification rule is varied by varying a certain threshold, here in terms of mu.) So I'm not sure if this is what you expect/need.

@adietzel
Copy link
Author

Thanks a lot for the suggestion, Frank. I haven't had the chance to look into this yet. I have been busy trying to implement cv_varsel() with stratified kfold cross-validation with a binomial response variable, as discussed above. I believe it is worth sharing in case other users encounter this issue and that this issue is the appropriate issue given its title. I am using stratified (by y) CV to ensure the few presences in my data are distributed equally across folds.

TLDR:

  • bernoulli() not supported by projpred (accordings to init_refmodel)
  • binomial() without specifying trials(1) deprecated in brms but including trials(1) throws error in init_refmodel() stating that function trials() could not be found
  • Work-around to omit trials(1) and live with warnings that this formula specification is deprecated in brms

Here is a reproducible example, following the reprex you (Frank) used in #160 but tweaked it slightly to

  • have a binomial response variable
  • use stratified CV (K = 5, group = "y")

Using a bernoulli distribution would be the most appropriate given that I am using presence-absence data.
init_refmodel() unfortunately throws an error saying that bernoulli is not supported by projpred (version 2.2.1), which is interesting since it's meant to be supported. Perhaps not updated in init_refmodel() yet?

library(brms)
library(projpred)
data("df_gaussian")
mydat <- break_up_matrix_term(y ~ x, data = df_gaussian)$data
mydat <- head(mydat, 40)

mydat$y <- ifelse(mydat$y > 0, 1, 0)

myfit <- brm(
  formula = y ~ x1 + x2 + x3 + x4 + x5,
  data = mydat,
  family = bernoulli(),
  seed = 4981218
)

myformula <- formula(formula(myfit))
myextract_model_data <- function(object, newdata = NULL,
                                 wrhs = NULL, orhs = NULL, extract_y = TRUE) {
  resp_form <- if (extract_y) projpred:::lhs(myformula) else NULL
  args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
  return(projpred:::do_call(projpred:::.extract_model_data, args))
}
myK <- 5
myfolds <- loo::kfold_split_stratified(K = myK, x = mydat$y)

myfits_K <- kfold(myfit, folds = myfolds, save_fits = TRUE)
myrefmod <- init_refmodel(
  object = myfit,
  data = mydat,
  formula = myformula,
  family = myfit$family,
  extract_model_data = myextract_model_data,
  # dis = as.matrix(myfit)[, "sigma"], # original
  dis = rep(0, nrow(as.matrix(myfit))), # work around since setting dis = NULL throws an error
  cvfits = structure(list("fits" = myfits_K$fits[, "fit"]),
                     "K" = myK, "folds" = myfolds)
)

If I instead use a binomial, the appropriate formula specification in brms would be:
"y | trials(1) ~ ... "
Skipping trials(1) is deprecated, throws a warning, and might break in future versions?

Using trials(1), however, throws the error:
"Error in trials(1) : could not find function "trials""

myfit <- brm(
  formula = y | trials(1) ~ x1 + x2 + x3 + x4 + x5,
  data = mydat,
  family = binomial(),
  seed = 4981218
)

myformula <- formula(formula(myfit))
myextract_model_data <- function(object, newdata = NULL,
                                 wrhs = NULL, orhs = NULL, extract_y = TRUE) {
  resp_form <- if (extract_y) projpred:::lhs(myformula) else NULL
  args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
  return(projpred:::do_call(projpred:::.extract_model_data, args))
}
myK <- 5
myfolds <- loo::kfold_split_stratified(K = myK, x = mydat$y)

myfits_K <- kfold(myfit, folds = myfolds, save_fits = TRUE)
myrefmod <- init_refmodel(
  object = myfit,
  data = mydat,
  formula = myformula,
  family = myfit$family,
  extract_model_data = myextract_model_data,
  # dis = as.matrix(myfit)[, "sigma"], # original
  dis = rep(0, nrow(as.matrix(myfit))), # work around since setting dis = NULL throws an error
  cvfits = structure(list("fits" = myfits_K$fits[, "fit"]),
                     "K" = myK, "folds" = myfolds)
)

Currently, the only way to make this approach work seems to be omitting trials(1) and living with the warning from brms:

myfit <- brm(
  formula = y ~ x1 + x2 + x3 + x4 + x5,
  data = mydat,
  family = binomial(),
  seed = 4981218
)

myformula <- formula(formula(myfit))
myextract_model_data <- function(object, newdata = NULL,
                                 wrhs = NULL, orhs = NULL, extract_y = TRUE) {
  resp_form <- if (extract_y) projpred:::lhs(myformula) else NULL
  args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
  return(projpred:::do_call(projpred:::.extract_model_data, args))
}
myK <- 5
myfolds <- loo::kfold_split_stratified(K = myK, x = mydat$y)
table(mydat$y, myfolds)
myfits_K <- kfold(myfit, folds = myfolds, save_fits = TRUE)
myrefmod <- init_refmodel(
  object = myfit,
  data = mydat,
  formula = myformula,
  family = binomial(),
  extract_model_data = myextract_model_data,
  # dis = as.matrix(myfit)[, "sigma"], # original
  dis = rep(0, nrow(as.matrix(myfit))), # work around since setting dis = NULL throws an error
  cvfits = structure(list("fits" = myfits_K$fits[, "fit"]),
                     "K" = myK, "folds" = myfolds)
)
mycvvs <- cv_varsel(
  myrefmod,
  method = "forward",
  cv_method = "kfold",
  K = myK,
  nclusters = 2,
  nclusters_pred = 3,
  nterms_max = 4,
  seed = 86132035
)

Just thought I would make you aware of this "issue" and provide a working example to others interested in using stratified CV with a binomial classifier.

Cheers

@adietzel
Copy link
Author

As Frank pointed out in #352, the bernoulli() distribution can be implemented by using brms:::get_refmodel.brmsfit() instead of init_refmodel().

@fweber144
Copy link
Collaborator

As Frank pointed out in #352, the bernoulli() distribution can be implemented by using brms:::get_refmodel.brmsfit() instead of init_refmodel().

Yes, thank you for updating the issue here, too. And for those stumbling across the issue here, I want to point out that brms:::get_refmodel.brmsfit()'s approach is to convert the brms::bernoulli() family to a binomial() one, so that approach could also be used for a custom reference model created by init_refmodel() if the underlying fit is a brms::bernoulli() one.

binomial() without specifying trials(1) deprecated in brms but including trials(1) throws error in init_refmodel() stating that function trials() could not be found

Yes, this is related to the fact that brms:::get_refmodel.brmsfit() (or the approach used there) needs to be used.

I am using stratified (by y) CV to ensure the few presences in my data are distributed equally across folds.

Stratifying K-fold CV by the response variable seems a bit odd to me. I think it's meant to be stratified by a predictor variable. But I haven't thought this through yet. It might be a valid procedure. In any case, this is a question related to package loo, not projpred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancements of existing features, but also new feature requests.
Projects
None yet
Development

No branches or pull requests

2 participants