Skip to content

Commit

Permalink
Major enhancements in the docs (#233)
Browse files Browse the repository at this point in the history
* Update roxygen2 version.

* Consistently use `#'` (not `##'`) for roxygen2 comments.

* Speed up the examples in the documentation (and minor other improvements in the docs).

* Major enhancements in the documentation.

* Move documentation for `extract_model_data` to section "Details".

* Enhance documentation (continued).

* Fix issue #190 (note in documentation concerning interactions in L1 search).

* Improve formatting in the `CITATION` file (make it more consistent).

* Enhance documentation (continued).

* Add `print()` to examples (otherwise, only output from the last line will be printed).

* Add documentation for `as.matrix.projection()`.

* Show usage of the `posterior` package (in the `as.matrix.projection()` example).

* Argument `regul` is only relevant for submodels which are GLMs. Unify the documentation for `regul` (two occurrences).

* Show usage of **bayesplot** in the example for `as.matrix.projection()`.

* Enhance documentation (continued; `d_test`).

* Use `@inheritParams varsel` instead of `@template args-vsel`.

* Use a consistent (and enhanced) documentation for arguments `seed` and `.seed`.

* Enhance documentation (continued; fix bugs).

* Use roxygen2 syntax also in a non-roxygen2 comment (to avoid inconsistencies in case that comment is turned into a roxygen2 comment).

* Use roxygen2 syntax also in an invisible documentation (to avoid inconsistencies in case that documentation is turned on).

* In `DESCRIPTION`, put the `Package:` field at the beginning because some search routines require this.

* Enhance the `Description:` field in `DESCRIPTION`.

* Re-order the `DESCRIPTION` fields (affects the ordering in the PDF manual).

* Use Markdown syntax consistently.

* Resolve minor inconsistencies in the documentation.

* Resolve minor inconsistencies regarding the abbreviation "CV".

* Enhance documentation (continued; indentation and other formatting).

* Enhance documentation (continued; `man-roxygen/args-newdata.R`).

* Enhance documentation (continued; mainly `refmodel-init-get`; this probably also

solves issue #156).

* Enhance documentation (continued; `predict.refmodel()`).

* Enhance documentation (continued; mainly `project()`; this probably also solves

a part of issue #154 (namely the documentation) for now).

* Document the now consistent behavior of `suggest_size()` (see issue #164).

* Minor wording improvements.

* Enhance documentation (continued; `projection-linpred-predict`).

* Rename documentation tag `projection-linpred-predict` to `pred-projection`.

* Enhance documentation (continued; `pred-projection`).

* Enhance documentation (continued; mainly `varsel()`).

* Enhance documentation (continued; mainly `cv_varsel()`).

* Use a consistent language for "response" and "predictors".

* Minor improvements in `data.R`.

* Minor improvement for `acc`/`pctcorr`.

* Minor wording improvement.

* Make journal volume numbers bold (as in other R docs).

* Add references for PSIS-LOO CV.

* Fix a typo in a reference.

* Examples: Add a note concerning `chains = 2, iter = 500`.

* Add an example for a custom reference model via `init_refmodel()` (which

partly solves issue #125) and perform minor improvements in related
example code.

* `DESCRIPTION`: Add the `Date:` field (important for the `CITATION` file).

* Fix a typo in the docs (`as.matrix.projection()` example).

* Minor wording improvement in the docs (`as.matrix.projection()` example).

* Docs: Remove "[stat<...>]" for arXiv links, according to the CRAN convention.

* Docs: Add a comment concerning the relevance of

`ndraws<_pred>` and `nclusters<_pred>` in case of `cv_search = FALSE`.

* Docs: Use the terms "search step" and "evaluation step" (as a repeating pattern, this

should be easier to recognize for readers).

* Docs: Avoid the `@details [...] Notes: <list>` construction (looks

strange in the final documentation). Unfortunately, `@note` can't
be used since otherwise, `cv_varsel()`'s documentation couldn't
inherit that section from `varsel()`'s documentation.

* There are no Gaussian-only `stats` (at least it isn't implemented this

way). Therefore, adapt the documentation.

* Docs: Add more details concerning argument `method`.

* Minor wording improvements in the general package documentation (help topic

`projpred-package`).

* References in the docs: Replace "doi:[<...>](<...>)" by "DOI: [<...>](<...>)". In

the docs, this does not seem to be necessary, but in the vignette, the former
way caused the DOI hyperlinks to be rendered incorrectly.

* Docs: Handle section "References" consistently.

* Replace "arxiv" by "arXiv".

* Docs: Replace "search step" and "evaluation step" by

"search part" and "evaluation part", respectively. The reason is a possible
confusion of the term "search step" with the steps of a stepwise variable
selection, see e.g. `?step`. The second reason is that the term "step" might
suggest an EM-like alternation of steps (which is not the case here, since the
search is completed first and only after that, the evaluation takes place).

* Minor wording improvements in the general package documentation (available

at ``?`projpred-package` ``).

* Fix a reflow bug in the general package documentation (available

at ``?`projpred-package` ``).

* `DESCRIPTION`: Minor cleaning.

* `DESCRIPTION`: Reflow to margin width 80.

* Activate the examples in Travis CI checks.
  • Loading branch information
fweber144 authored Oct 11, 2021
1 parent 7e67eed commit dd4bc36
Show file tree
Hide file tree
Showing 46 changed files with 2,347 additions and 1,830 deletions.
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ warnings_are_errors: false

latex: true
r_build_args: '--no-build-vignettes'
r_check_args: '--ignore-vignettes --no-examples'
r_check_args: '--ignore-vignettes'

r_github_packages:
- jimhester/covr
Expand All @@ -30,7 +30,7 @@ before_install:

script:
- travis_wait 120 R CMD build . --no-build-vignettes
- travis_wait 120 R CMD check projpred*tar.gz --ignore-vignettes --no-examples
- travis_wait 120 R CMD check projpred*tar.gz --ignore-vignettes

after_success:
- travis_wait 120 Rscript -e 'covr::codecov()'
38 changes: 21 additions & 17 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
Encoding: UTF-8
Package: projpred
Encoding: UTF-8
Title: Projection Predictive Feature Selection
Version: 2.0.5.9000
Authors@R: c(person("Juho", "Piironen", role = c("aut"), email = "[email protected]"),
Date: 2021-09-21
Authors@R: c(person("Juho", "Piironen", role = c("aut"),
email = "[email protected]"),
person("Markus", "Paasiniemi", role = "aut"),
person("Alejandro", "Catalina", role = c("cre", "aut"), email="[email protected]"),
person("Alejandro", "Catalina", role = c("cre", "aut"),
email = "[email protected]"),
person("Aki", "Vehtari", role = "aut"),
person("Jonah", "Gabry", role = "ctb"),
person("Marco", "Colombo", role = "ctb"),
Expand All @@ -15,12 +18,15 @@ Maintainer:
Alejandro Catalina <[email protected]>
Description:
Performs projection predictive feature selection for generalized linear
models and generalized linear and additive multilevel models
(see, Piironen, Paasiniemi and Vehtari, 2020, <https://projecteuclid.org/euclid.ejs/1589335310>,
Catalina, Bürkner and Vehtari, 2020, <arXiv:2010.06994>).
The package is compatible with the 'rstanarm' and 'brms' packages, but other
reference models can also be used. See the package vignette for more
information and examples.
models and generalized linear and additive multilevel models (see Piironen,
Paasiniemi and Vehtari, 2020, <doi:10.1214/20-EJS1711>; Catalina, Bürkner
and Vehtari, 2020, <arXiv:2010.06994>). The package is compatible with the
'rstanarm' and 'brms' packages, but other reference models can also be used.
See the documentation as well as the package vignettes for more information
and examples.
License: GPL-3
URL: https://mc-stan.org/projpred/, https://discourse.mc-stan.org
BugReports: https://github.com/stan-dev/projpred/issues/
Depends:
R (>= 3.5.0)
Imports:
Expand All @@ -36,11 +42,6 @@ Imports:
mgcv,
gamm4,
rlang
LinkingTo: Rcpp, RcppArmadillo
License: GPL-3
LazyData: TRUE
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
Suggests:
rstanarm,
brms,
Expand All @@ -49,7 +50,10 @@ Suggests:
rmarkdown,
glmnet,
bayesplot (>= 1.5.0),
optimx
optimx,
posterior
LinkingTo: Rcpp, RcppArmadillo
LazyData: TRUE
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
VignetteBuilder: knitr
URL: https://mc-stan.org/projpred/, https://discourse.mc-stan.org
BugReports: https://github.com/stan-dev/projpred/issues/
132 changes: 82 additions & 50 deletions R/cv_varsel.R
Original file line number Diff line number Diff line change
@@ -1,59 +1,91 @@
#' Cross-validated variable selection (varsel)
#' Variable selection with cross-validation
#'
#' Perform cross-validation for the projective variable selection for a
#' generalized linear model or generalized lienar and additive multilevel
#' models.
#' Perform the projection predictive variable selection for (G)LMs, (G)LMMs,
#' (G)AMs, and (G)AMMs. This variable selection consists of a *search* part and
#' an *evaluation* part. The search part determines the solution path, i.e., the
#' best submodel for each number of predictor terms (model size). The evaluation
#' part determines the predictive performance of the submodels along the
#' solution path. In contrast to [varsel()], [cv_varsel()] performs a
#' cross-validation (CV) by running the search part with the training data of
#' each CV fold separately (an exception is explained in section "Note" below)
#' and running the evaluation part on the corresponding test set of each CV
#' fold.
#'
#' @name cv_varsel
#'
#' @template args-vsel
#' @param cv_method The cross-validation method, either 'LOO' or 'kfold'.
#' Default is 'LOO'.
#' @param nloo Number of observations used to compute the LOO validation
#' (anything between 1 and the total number of observations). Smaller values
#' lead to faster computation but higher uncertainty (larger errorbars) in the
#' accuracy estimation. Default is to use all observations, but for faster
#' experimentation, one can set this to a small value such as 100. Only
#' applicable if \code{cv_method = 'LOO'}.
#' @param K Number of folds in the K-fold cross validation. Default is 5 for
#' genuine reference models and 10 for datafits (that is, for penalized
#' @inheritParams varsel
#' @param cv_method The CV method, either `"LOO"` or `"kfold"`. In the `"LOO"`
#' case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV)
#' is performed, which avoids refitting the reference model `nloo` times (in
#' contrast to a standard LOO CV). In the `"kfold"` case, a \eqn{K}-fold CV is
#' performed.
#' @param nloo Only relevant if `cv_method == "LOO"`. Number of subsampled LOO
#' CV folds, i.e., number of observations used for the LOO CV (anything
#' between 1 and the original number of observations). Smaller values lead to
#' faster computation but higher uncertainty in the evaluation part. If
#' `NULL`, all observations are used, but for faster experimentation, one can
#' set this to a smaller value.
#' @param K Only relevant if `cv_method == "kfold"`. Number of folds in the
#' \eqn{K}-fold CV. If `NULL`, then `5` is used for genuine reference models
#' (i.e., of class `refmodel`) and `10` for `datafit`s (that is, for penalized
#' maximum likelihood estimation).
#' @param validate_search Whether to cross-validate also the selection process,
#' that is, whether to perform selection separately for each fold. Default is
#' TRUE and we strongly recommend not setting this to FALSE, because this is
#' known to bias the accuracy estimates for the selected submodels. However,
#' setting this to FALSE can sometimes be useful because comparing the results
#' to the case where this parameter is TRUE gives idea how strongly the
#' feature selection is (over)fitted to the data (the difference corresponds
#' to the search degrees of freedom or the effective number of parameters
#' introduced by the selectin process).
#' @param seed Random seed used in the subsampling LOO. By default uses a fixed
#' seed.
#' @param validate_search Only relevant if `cv_method == "LOO"`. A single
#' logical value indicating whether to cross-validate also the search part,
#' i.e., whether to run the search separately for each CV fold (`TRUE`) or not
#' (`FALSE`). We strongly do not recommend setting this to `FALSE`, because
#' this is known to bias the predictive performance estimates of the selected
#' submodels. However, setting this to `FALSE` can sometimes be useful because
#' comparing the results to the case where this argument is `TRUE` gives an
#' idea of how strongly the variable selection is (over-)fitted to the data
#' (the difference corresponds to the search degrees of freedom or the
#' effective number of parameters introduced by the search).
#' @param seed Pseudorandom number generation (PRNG) seed by which the same
#' results can be obtained again if needed. If `NULL`, no seed is set and
#' therefore, the results are not reproducible. See [set.seed()] for details.
#' Here, this seed is used for clustering the reference model's posterior
#' draws (if `!is.null(nclusters)`), for subsampling LOO CV folds (if `nloo`
#' is smaller than the number of observations), and for sampling the folds in
#' K-fold CV.
#'
#' @details Using less draws or clusters in \code{ndraws}, \code{nclusters},
#' \code{nclusters_pred}, or \code{ndraws_pred} than posterior draws in the
#' reference model may result in slightly inaccurate projection performance.
#' Increasing these arguments linearly affects the computation time.
#' @inherit varsel details return
#'
#' @return An object of type \code{vsel} that contains information about the
#' feature selection. The fields are not meant to be accessed directly by the
#' user but instead via the helper functions (see the vignettes or type
#' ?projpred to see the main functions in the package.)
#' @note The case `cv_method == "LOO" && !validate_search` constitutes an
#' exception where the search part is not cross-validated. In that case, the
#' evaluation part is based on a PSIS-LOO CV.
#'
#' @examples
#' \donttest{
#' if (requireNamespace('rstanarm', quietly=TRUE)) {
#' ### Usage with stanreg objects
#' n <- 30
#' d <- 5
#' x <- matrix(rnorm(n*d), nrow=n)
#' y <- x[,1] + 0.5*rnorm(n)
#' data <- data.frame(x,y)
#' fit <- rstanarm::stan_glm(y ~ X1 + X2 + X3 + X4 + X5, gaussian(),
#' data=data, chains=2, iter=500)
#' cvs <- cv_varsel(fit)
#' plot(cvs)
#' }
#' @references
#'
#' Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model
#' evaluation using leave-one-out cross-validation and WAIC. *Statistics and
#' Computing*, **27**(5), 1413-1432. DOI:
#' [10.1007/s11222-016-9696-4](https://doi.org/10.1007/s11222-016-9696-4).
#'
#' Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2021). Pareto
#' smoothed importance sampling. *arXiv:1507.02646*. URL:
#' <https://arxiv.org/abs/1507.02646>.
#'
#' @seealso [varsel()]
#'
#' @examplesIf identical(Sys.getenv("RUN_EX"), "true")
#' # Note: The code from this example is not executed when called via example().
#' # To execute it, you have to copy and paste it manually to the console.
#' if (requireNamespace("rstanarm", quietly = TRUE)) {
#' # Data:
#' dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
#'
#' # The "stanreg" fit which will be used as the reference model (with small
#' # values for `chains` and `iter`, but only for technical reasons in this
#' # example; this is not recommended in general):
#' fit <- rstanarm::stan_glm(
#' y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
#' QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
#' )
#'
#' # Variable selection with cross-validation (with small values
#' # for `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the
#' # sake of speed in this example; this is not recommended in general):
#' cvvs <- cv_varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
#' seed = 5555)
#' # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
#' # and `?solution_terms.vsel` for possible post-processing functions.
#' }
#'
#' @export
Expand Down
49 changes: 25 additions & 24 deletions R/data.R
Original file line number Diff line number Diff line change
@@ -1,41 +1,42 @@
#' Binomial toy example.
#' Binomial toy example
#'
#' @format A simulated classification dataset containing 100 observations.
#' \describe{
#' \item{y}{target, 0 or 1.}
#' \item{x}{features, 30 in total.}
#' \item{y}{response, 0 or 1.}
#' \item{x}{predictors, 30 in total.}
#' }
#' @source \url{http://web.stanford.edu/~hastie/glmnet/glmnetData/BNExample.RData}
#' @source <https://web.stanford.edu/~hastie/glmnet/glmnetData/BNExample.RData>
"df_binom"

#' Gaussian toy example.
#' Gaussian toy example
#'
#' @format A simulated regression dataset containing 100 observations.
#' \describe{
#' \item{y}{target, real-valued.}
#' \item{x}{features, 20 in total. Mean and sd approximately 0 and 1.}
#' \item{y}{response, real-valued.}
#' \item{x}{predictors, 20 in total. Mean and SD are approximately 0 and 1,
#' respectively.}
#' }
#' @source \url{http://web.stanford.edu/~hastie/glmnet/glmnetData/QSExample.RData}
#' @source <https://web.stanford.edu/~hastie/glmnet/glmnetData/QSExample.RData>
"df_gaussian"

#' Mesquite data set.
#'
#' The mesquite bushes yields data set from Gelman
#' and Hill (2007) (\url{http://www.stat.columbia.edu/~gelman/arm/}).
#' Mesquite data set
#'
#' @format The outcome variable is the total weight (in grams) of photosynthetic
#' material as derived from actual harvesting of the bush. The predictor
#' variables are:
#' The mesquite bushes yields data set from Gelman and Hill (2007)
#' (<http://www.stat.columbia.edu/~gelman/arm/>).
#'
#' @format The response variable is the total weight (in grams) of
#' photosynthetic material as derived from actual harvesting of the bush. The
#' predictor variables are:
#' \describe{
#' \item{diam1}{diameter of the canopy (the leafy area of the bush)
#' in meters, measured along the longer axis of the bush.}
#' \item{diam2}{canopy diameter measured along the shorter axis}
#' \item{canopy height}{height of the canopy.}
#' \item{total height}{total height of the bush.}
#' \item{density}{plant unit density (# of primary stems per plant unit).}
#' \item{group}{group of measurements (0 for the first group, 1 for the second group)}
#' \item{diam1}{diameter of the canopy (the leafy area of the bush) in meters,
#' measured along the longer axis of the bush.}
#' \item{diam2}{canopy diameter measured along the shorter axis.}
#' \item{canopy height}{height of the canopy.}
#' \item{total height}{total height of the bush.}
#' \item{density}{plant unit density (# of primary stems per plant unit).}
#' \item{group}{group of measurements (0 for the first group, 1 for the second
#' group).}
#' }
#'
#' @source \url{http://www.stat.columbia.edu/~gelman/arm/examples/}
#' @source <http://www.stat.columbia.edu/~gelman/arm/examples/mesquite/mesquite.dat>
"mesquite"

19 changes: 12 additions & 7 deletions R/extend_family.R
Original file line number Diff line number Diff line change
@@ -1,16 +1,21 @@
# Model-specific helper functions.
#
# \code{extend_family(family)} returns a family object augmented with auxiliary
# functions that
# are needed for computing KL-divergence, log predictive density, dispersion
# projection etc.
# `extend_family(family)` returns a [`family`] object augmented with auxiliary
# functions that are needed for computing KL-divergence, log predictive density,
# dispersion projection, etc.
#
# Missing: Quasi-families are not implemented. If dis_gamma is the correct shape
# parameter for projected Gamma regression, everything should be OK for gamma.

#' Add extra fields to the family object.
#' @param family Family object.
#' @return Extended family object.
#' Extend a family
#'
#' This function adds elements to a [`family`] object. It is called internally
#' by [init_refmodel()], so you will rarely need to call it yourself.
#'
#' @param family A [`family`] object.
#'
#' @return The [`family`] object extended in the way needed by \pkg{projpred}.
#'
#' @export
extend_family <- function(family) {
if (.has_family_extras(family)) {
Expand Down
13 changes: 6 additions & 7 deletions R/families.R
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
#' Extra family objects.
#' Extra family objects
#'
#' Family objects not in the set of default \link[=family]{family}-objects.
#' Family objects not in the set of default [`family`] objects.
#'
#' @name extra-families
#'
#' @param link Specification of the link function, as for the default
#' \link[=family]{family}-objects.
#' @param nu Degrees of freedom for the Student-t distribution.
#' @param link Name of the link function. In contrast to the default [`family`]
#' objects, this has to be a character string here.
#' @param nu Degrees of freedom for the Student-\eqn{t} distribution.
#'
#' @return A family object analogous to those described in
#' \link[=family]{family}
#' @return A family object analogous to those described in [`family`].
#'
NULL

Expand Down
22 changes: 12 additions & 10 deletions R/helper_formula.R
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
#' Utilities to handle formulas for the external user
#' @name helper_formula
NULL

#' Break up matrix terms
#'
#' Sometimes there can be terms in a formula that refer to a matrix instead of a
#' single predictor. Because we can handle search_terms of predictors, this
#' function breaks the matrix term into individual predictors to handle
#' separately, as that is probably the intention of the user.
#' @param formula A formula for a valid model.
#' @param data The original data frame with a matrix as predictor.
#' @return a list containing the expanded formula and the expanded data frame.
#' single predictor. This function breaks up the matrix term into individual
#' predictors to handle separately, as that is probably the intention of the
#' user.
#'
#' @param formula A [`formula`] for a valid model.
#' @param data The original `data.frame` with a matrix as predictor.
#'
#' @return A `list` containing the expanded [`formula`] and the expanded
#' `data.frame`.
#'
#' @export
break_up_matrix_term <- function(formula, data) {
formula <- expand_formula(formula, data)
Expand Down
Loading

0 comments on commit dd4bc36

Please sign in to comment.