Major enhancements in the docs (#233)

* Update roxygen2 version. * Consistently use `#'` (not `##'`) for roxygen2 comments. * Speed up the examples in the documentation (and minor other improvements in the docs). * Major enhancements in the documentation. * Move documentation for `extract_model_data` to section "Details". * Enhance documentation (continued). * Fix issue #190 (note in documentation concerning interactions in L1 search). * Improve formatting in the `CITATION` file (make it more consistent). * Enhance documentation (continued). * Add `print()` to examples (otherwise, only output from the last line will be printed). * Add documentation for `as.matrix.projection()`. * Show usage of the `posterior` package (in the `as.matrix.projection()` example). * Argument `regul` is only relevant for submodels which are GLMs. Unify the documentation for `regul` (two occurrences). * Show usage of **bayesplot** in the example for `as.matrix.projection()`. * Enhance documentation (continued; `d_test`). * Use `@inheritParams varsel` instead of `@template args-vsel`. * Use a consistent (and enhanced) documentation for arguments `seed` and `.seed`. * Enhance documentation (continued; fix bugs). * Use roxygen2 syntax also in a non-roxygen2 comment (to avoid inconsistencies in case that comment is turned into a roxygen2 comment). * Use roxygen2 syntax also in an invisible documentation (to avoid inconsistencies in case that documentation is turned on). * In `DESCRIPTION`, put the `Package:` field at the beginning because some search routines require this. * Enhance the `Description:` field in `DESCRIPTION`. * Re-order the `DESCRIPTION` fields (affects the ordering in the PDF manual). * Use Markdown syntax consistently. * Resolve minor inconsistencies in the documentation. * Resolve minor inconsistencies regarding the abbreviation "CV". * Enhance documentation (continued; indentation and other formatting). * Enhance documentation (continued; `man-roxygen/args-newdata.R`). * Enhance documentation (continued; mainly `refmodel-init-get`; this probably also solves issue #156). * Enhance documentation (continued; `predict.refmodel()`). * Enhance documentation (continued; mainly `project()`; this probably also solves a part of issue #154 (namely the documentation) for now). * Document the now consistent behavior of `suggest_size()` (see issue #164). * Minor wording improvements. * Enhance documentation (continued; `projection-linpred-predict`). * Rename documentation tag `projection-linpred-predict` to `pred-projection`. * Enhance documentation (continued; `pred-projection`). * Enhance documentation (continued; mainly `varsel()`). * Enhance documentation (continued; mainly `cv_varsel()`). * Use a consistent language for "response" and "predictors". * Minor improvements in `data.R`. * Minor improvement for `acc`/`pctcorr`. * Minor wording improvement. * Make journal volume numbers bold (as in other R docs). * Add references for PSIS-LOO CV. * Fix a typo in a reference. * Examples: Add a note concerning `chains = 2, iter = 500`. * Add an example for a custom reference model via `init_refmodel()` (which partly solves issue #125) and perform minor improvements in related example code. * `DESCRIPTION`: Add the `Date:` field (important for the `CITATION` file). * Fix a typo in the docs (`as.matrix.projection()` example). * Minor wording improvement in the docs (`as.matrix.projection()` example). * Docs: Remove "[stat<...>]" for arXiv links, according to the CRAN convention. * Docs: Add a comment concerning the relevance of `ndraws<_pred>` and `nclusters<_pred>` in case of `cv_search = FALSE`. * Docs: Use the terms "search step" and "evaluation step" (as a repeating pattern, this should be easier to recognize for readers). * Docs: Avoid the `@details [...] Notes: <list>` construction (looks strange in the final documentation). Unfortunately, `@note` can't be used since otherwise, `cv_varsel()`'s documentation couldn't inherit that section from `varsel()`'s documentation. * There are no Gaussian-only `stats` (at least it isn't implemented this way). Therefore, adapt the documentation. * Docs: Add more details concerning argument `method`. * Minor wording improvements in the general package documentation (help topic `projpred-package`). * References in the docs: Replace "doi:[<...>](<...>)" by "DOI: [<...>](<...>)". In the docs, this does not seem to be necessary, but in the vignette, the former way caused the DOI hyperlinks to be rendered incorrectly. * Docs: Handle section "References" consistently. * Replace "arxiv" by "arXiv". * Docs: Replace "search step" and "evaluation step" by "search part" and "evaluation part", respectively. The reason is a possible confusion of the term "search step" with the steps of a stepwise variable selection, see e.g. `?step`. The second reason is that the term "step" might suggest an EM-like alternation of steps (which is not the case here, since the search is completed first and only after that, the evaluation takes place). * Minor wording improvements in the general package documentation (available at ``?`projpred-package` ``). * Fix a reflow bug in the general package documentation (available at ``?`projpred-package` ``). * `DESCRIPTION`: Minor cleaning. * `DESCRIPTION`: Reflow to margin width 80. * Activate the examples in Travis CI checks.
stan-dev · Oct 11, 2021 · dd4bc36 · dd4bc36
1 parent 7e67eed
commit dd4bc36
Show file tree

Hide file tree

Showing 46 changed files with 2,347 additions and 1,830 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -5,7 +5,7 @@ warnings_are_errors: false
 
 latex: true
 r_build_args: '--no-build-vignettes'
-r_check_args: '--ignore-vignettes --no-examples'
+r_check_args: '--ignore-vignettes'
 
 r_github_packages:
   - jimhester/covr
@@ -30,7 +30,7 @@ before_install:
 
 script:
   - travis_wait 120 R CMD build . --no-build-vignettes
-  - travis_wait 120 R CMD check projpred*tar.gz --ignore-vignettes --no-examples
+  - travis_wait 120 R CMD check projpred*tar.gz --ignore-vignettes
 
 after_success:
 - travis_wait 120 Rscript -e 'covr::codecov()'
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,10 +1,13 @@
-Encoding: UTF-8
 Package: projpred
+Encoding: UTF-8
 Title: Projection Predictive Feature Selection
 Version: 2.0.5.9000
-Authors@R: c(person("Juho", "Piironen", role = c("aut"), email = "[email protected]"),
+Date: 2021-09-21
+Authors@R: c(person("Juho", "Piironen", role = c("aut"),
+                    email = "[email protected]"),
              person("Markus", "Paasiniemi", role = "aut"),
-             person("Alejandro", "Catalina", role = c("cre", "aut"), email="[email protected]"),
+             person("Alejandro", "Catalina", role = c("cre", "aut"),
+                    email = "[email protected]"),
              person("Aki", "Vehtari", role = "aut"),
              person("Jonah", "Gabry", role = "ctb"),
              person("Marco", "Colombo", role = "ctb"),
@@ -15,12 +18,15 @@ Maintainer:
     Alejandro Catalina <[email protected]>
 Description:
     Performs projection predictive feature selection for generalized linear
-    models and generalized linear and additive multilevel models
-    (see, Piironen, Paasiniemi and Vehtari, 2020, <https://projecteuclid.org/euclid.ejs/1589335310>,
-    Catalina, Bürkner and Vehtari, 2020, <arXiv:2010.06994>).
-    The package is compatible with the 'rstanarm' and 'brms' packages, but other
-    reference models can also be used. See the package vignette for more
-    information and examples.
+    models and generalized linear and additive multilevel models (see Piironen,
+    Paasiniemi and Vehtari, 2020, <doi:10.1214/20-EJS1711>; Catalina, Bürkner
+    and Vehtari, 2020, <arXiv:2010.06994>). The package is compatible with the
+    'rstanarm' and 'brms' packages, but other reference models can also be used.
+    See the documentation as well as the package vignettes for more information
+    and examples.
+License: GPL-3
+URL: https://mc-stan.org/projpred/, https://discourse.mc-stan.org
+BugReports: https://github.com/stan-dev/projpred/issues/
 Depends:
     R (>= 3.5.0)
 Imports:
@@ -36,11 +42,6 @@ Imports:
     mgcv,
     gamm4,
     rlang
-LinkingTo: Rcpp, RcppArmadillo
-License: GPL-3
-LazyData: TRUE
-Roxygen: list(markdown = TRUE)
-RoxygenNote: 7.1.1
 Suggests:
     rstanarm,
     brms,
@@ -49,7 +50,10 @@ Suggests:
     rmarkdown,
     glmnet,
     bayesplot (>= 1.5.0),
-    optimx
+    optimx,
+    posterior
+LinkingTo: Rcpp, RcppArmadillo
+LazyData: TRUE
+Roxygen: list(markdown = TRUE)
+RoxygenNote: 7.1.2
 VignetteBuilder: knitr
-URL: https://mc-stan.org/projpred/, https://discourse.mc-stan.org
-BugReports: https://github.com/stan-dev/projpred/issues/
diff --git a/R/cv_varsel.R b/R/cv_varsel.R
@@ -1,59 +1,91 @@
-#' Cross-validated variable selection (varsel)
+#' Variable selection with cross-validation
 #'
-#' Perform cross-validation for the projective variable selection for a
-#' generalized linear model or generalized lienar and additive multilevel
-#' models.
+#' Perform the projection predictive variable selection for (G)LMs, (G)LMMs,
+#' (G)AMs, and (G)AMMs. This variable selection consists of a *search* part and
+#' an *evaluation* part. The search part determines the solution path, i.e., the
+#' best submodel for each number of predictor terms (model size). The evaluation
+#' part determines the predictive performance of the submodels along the
+#' solution path. In contrast to [varsel()], [cv_varsel()] performs a
+#' cross-validation (CV) by running the search part with the training data of
+#' each CV fold separately (an exception is explained in section "Note" below)
+#' and running the evaluation part on the corresponding test set of each CV
+#' fold.
 #'
-#' @name cv_varsel
-#'
-#' @template args-vsel
-#' @param cv_method The cross-validation method, either 'LOO' or 'kfold'.
-#'   Default is 'LOO'.
-#' @param nloo Number of observations used to compute the LOO validation
-#'   (anything between 1 and the total number of observations). Smaller values
-#'   lead to faster computation but higher uncertainty (larger errorbars) in the
-#'   accuracy estimation. Default is to use all observations, but for faster
-#'   experimentation, one can set this to a small value such as 100. Only
-#'   applicable if \code{cv_method = 'LOO'}.
-#' @param K Number of folds in the K-fold cross validation. Default is 5 for
-#'   genuine reference models and 10 for datafits (that is, for penalized
+#' @inheritParams varsel
+#' @param cv_method The CV method, either `"LOO"` or `"kfold"`. In the `"LOO"`
+#'   case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV)
+#'   is performed, which avoids refitting the reference model `nloo` times (in
+#'   contrast to a standard LOO CV). In the `"kfold"` case, a \eqn{K}-fold CV is
+#'   performed.
+#' @param nloo Only relevant if `cv_method == "LOO"`. Number of subsampled LOO
+#'   CV folds, i.e., number of observations used for the LOO CV (anything
+#'   between 1 and the original number of observations). Smaller values lead to
+#'   faster computation but higher uncertainty in the evaluation part. If
+#'   `NULL`, all observations are used, but for faster experimentation, one can
+#'   set this to a smaller value.
+#' @param K Only relevant if `cv_method == "kfold"`. Number of folds in the
+#'   \eqn{K}-fold CV. If `NULL`, then `5` is used for genuine reference models
+#'   (i.e., of class `refmodel`) and `10` for `datafit`s (that is, for penalized
 #'   maximum likelihood estimation).
-#' @param validate_search Whether to cross-validate also the selection process,
-#'   that is, whether to perform selection separately for each fold. Default is
-#'   TRUE and we strongly recommend not setting this to FALSE, because this is
-#'   known to bias the accuracy estimates for the selected submodels. However,
-#'   setting this to FALSE can sometimes be useful because comparing the results
-#'   to the case where this parameter is TRUE gives idea how strongly the
-#'   feature selection is (over)fitted to the data (the difference corresponds
-#'   to the search degrees of freedom or the effective number of parameters
-#'   introduced by the selectin process).
-#' @param seed Random seed used in the subsampling LOO. By default uses a fixed
-#'   seed.
+#' @param validate_search Only relevant if `cv_method == "LOO"`. A single
+#'   logical value indicating whether to cross-validate also the search part,
+#'   i.e., whether to run the search separately for each CV fold (`TRUE`) or not
+#'   (`FALSE`). We strongly do not recommend setting this to `FALSE`, because
+#'   this is known to bias the predictive performance estimates of the selected
+#'   submodels. However, setting this to `FALSE` can sometimes be useful because
+#'   comparing the results to the case where this argument is `TRUE` gives an
+#'   idea of how strongly the variable selection is (over-)fitted to the data
+#'   (the difference corresponds to the search degrees of freedom or the
+#'   effective number of parameters introduced by the search).
+#' @param seed Pseudorandom number generation (PRNG) seed by which the same
+#'   results can be obtained again if needed. If `NULL`, no seed is set and
+#'   therefore, the results are not reproducible. See [set.seed()] for details.
+#'   Here, this seed is used for clustering the reference model's posterior
+#'   draws (if `!is.null(nclusters)`), for subsampling LOO CV folds (if `nloo`
+#'   is smaller than the number of observations), and for sampling the folds in
+#'   K-fold CV.
 #'
-#' @details Using less draws or clusters in \code{ndraws}, \code{nclusters},
-#'   \code{nclusters_pred}, or \code{ndraws_pred} than posterior draws in the
-#'   reference model may result in slightly inaccurate projection performance.
-#'   Increasing these arguments linearly affects the computation time.
+#' @inherit varsel details return
 #'
-#' @return An object of type \code{vsel} that contains information about the
-#'   feature selection. The fields are not meant to be accessed directly by the
-#'   user but instead via the helper functions (see the vignettes or type
-#'   ?projpred to see the main functions in the package.)
+#' @note The case `cv_method == "LOO" && !validate_search` constitutes an
+#'   exception where the search part is not cross-validated. In that case, the
+#'   evaluation part is based on a PSIS-LOO CV.
 #'
-#' @examples
-#' \donttest{
-#' if (requireNamespace('rstanarm', quietly=TRUE)) {
-#'   ### Usage with stanreg objects
-#'   n <- 30
-#'   d <- 5
-#'   x <- matrix(rnorm(n*d), nrow=n)
-#'   y <- x[,1] + 0.5*rnorm(n)
-#'   data <- data.frame(x,y)
-#'   fit <- rstanarm::stan_glm(y ~ X1 + X2 + X3 + X4 + X5, gaussian(),
-#'      data=data, chains=2, iter=500)
-#'   cvs <- cv_varsel(fit)
-#'   plot(cvs)
-#' }
+#' @references
+#'
+#' Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model
+#' evaluation using leave-one-out cross-validation and WAIC. *Statistics and
+#' Computing*, **27**(5), 1413-1432. DOI:
+#' [10.1007/s11222-016-9696-4](https://doi.org/10.1007/s11222-016-9696-4).
+#'
+#' Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2021). Pareto
+#' smoothed importance sampling. *arXiv:1507.02646*. URL:
+#' <https://arxiv.org/abs/1507.02646>.
+#'
+#' @seealso [varsel()]
+#'
+#' @examplesIf identical(Sys.getenv("RUN_EX"), "true")
+#' # Note: The code from this example is not executed when called via example().
+#' # To execute it, you have to copy and paste it manually to the console.
+#' if (requireNamespace("rstanarm", quietly = TRUE)) {
+#'   # Data:
+#'   dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
+#'
+#'   # The "stanreg" fit which will be used as the reference model (with small
+#'   # values for `chains` and `iter`, but only for technical reasons in this
+#'   # example; this is not recommended in general):
+#'   fit <- rstanarm::stan_glm(
+#'     y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
+#'     QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
+#'   )
+#'
+#'   # Variable selection with cross-validation (with small values
+#'   # for `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the
+#'   # sake of speed in this example; this is not recommended in general):
+#'   cvvs <- cv_varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
+#'                     seed = 5555)
+#'   # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
+#'   # and `?solution_terms.vsel` for possible post-processing functions.
 #' }
 #'
 #' @export

diff --git a/R/data.R b/R/data.R
@@ -1,41 +1,42 @@
-#' Binomial toy example.
+#' Binomial toy example
 #'
 #' @format A simulated classification dataset containing 100 observations.
 #' \describe{
-#'   \item{y}{target, 0 or 1.}
-#'   \item{x}{features, 30 in total.}
+#'   \item{y}{response, 0 or 1.}
+#'   \item{x}{predictors, 30 in total.}
 #' }
-#' @source \url{http://web.stanford.edu/~hastie/glmnet/glmnetData/BNExample.RData}
+#' @source <https://web.stanford.edu/~hastie/glmnet/glmnetData/BNExample.RData>
 "df_binom"
 
-#' Gaussian toy example.
+#' Gaussian toy example
 #'
 #' @format A simulated regression dataset containing 100 observations.
 #' \describe{
-#'   \item{y}{target, real-valued.}
-#'   \item{x}{features, 20 in total. Mean and sd approximately 0 and 1.}
+#'   \item{y}{response, real-valued.}
+#'   \item{x}{predictors, 20 in total. Mean and SD are approximately 0 and 1,
+#'   respectively.}
 #' }
-#' @source \url{http://web.stanford.edu/~hastie/glmnet/glmnetData/QSExample.RData}
+#' @source <https://web.stanford.edu/~hastie/glmnet/glmnetData/QSExample.RData>
 "df_gaussian"
 
-#' Mesquite data set.
-#' 
-#' The mesquite bushes yields data set from Gelman
-#' and Hill (2007) (\url{http://www.stat.columbia.edu/~gelman/arm/}).
+#' Mesquite data set
 #'
-#' @format  The outcome variable is the total weight (in grams) of photosynthetic
-#' material as derived from actual harvesting of the bush. The predictor
-#' variables are:
+#' The mesquite bushes yields data set from Gelman and Hill (2007)
+#' (<http://www.stat.columbia.edu/~gelman/arm/>).
+#'
+#' @format The response variable is the total weight (in grams) of
+#'   photosynthetic material as derived from actual harvesting of the bush. The
+#'   predictor variables are:
 #' \describe{
-#' \item{diam1}{diameter of the canopy (the leafy area of the bush)
-#' in meters, measured along the longer axis of the bush.}
-#' \item{diam2}{canopy diameter measured along the shorter axis}
-#' \item{canopy height}{height of the canopy.}
-#' \item{total height}{total height of the bush.}
-#' \item{density}{plant unit density (# of primary stems per plant unit).}
-#' \item{group}{group of measurements (0 for the first group, 1 for the second group)}
+#'   \item{diam1}{diameter of the canopy (the leafy area of the bush) in meters,
+#'   measured along the longer axis of the bush.}
+#'   \item{diam2}{canopy diameter measured along the shorter axis.}
+#'   \item{canopy height}{height of the canopy.}
+#'   \item{total height}{total height of the bush.}
+#'   \item{density}{plant unit density (# of primary stems per plant unit).}
+#'   \item{group}{group of measurements (0 for the first group, 1 for the second
+#'   group).}
 #' }
 #'
-#' @source \url{http://www.stat.columbia.edu/~gelman/arm/examples/}
+#' @source <http://www.stat.columbia.edu/~gelman/arm/examples/mesquite/mesquite.dat>
 "mesquite"
-
diff --git a/R/extend_family.R b/R/extend_family.R
@@ -1,16 +1,21 @@
 # Model-specific helper functions.
 #
-# \code{extend_family(family)} returns a family object augmented with auxiliary
-# functions that
-# are needed for computing KL-divergence, log predictive density, dispersion
-# projection etc.
+# `extend_family(family)` returns a [`family`] object augmented with auxiliary
+# functions that are needed for computing KL-divergence, log predictive density,
+# dispersion projection, etc.
 #
 # Missing: Quasi-families are not implemented. If dis_gamma is the correct shape
 # parameter for projected Gamma regression, everything should be OK for gamma.
 
-#' Add extra fields to the family object.
-#' @param family Family object.
-#' @return Extended family object.
+#' Extend a family
+#'
+#' This function adds elements to a [`family`] object. It is called internally
+#' by [init_refmodel()], so you will rarely need to call it yourself.
+#'
+#' @param family A [`family`] object.
+#'
+#' @return The [`family`] object extended in the way needed by \pkg{projpred}.
+#'
 #' @export
 extend_family <- function(family) {
   if (.has_family_extras(family)) {

diff --git a/R/families.R b/R/families.R
@@ -1,15 +1,14 @@
-#' Extra family objects.
+#' Extra family objects
 #'
-#' Family objects not in the set of default \link[=family]{family}-objects.
+#' Family objects not in the set of default [`family`] objects.
 #'
 #' @name extra-families
 #'
-#' @param link Specification of the link function, as for the default
-#'   \link[=family]{family}-objects.
-#' @param nu Degrees of freedom for the Student-t distribution.
+#' @param link Name of the link function. In contrast to the default [`family`]
+#'   objects, this has to be a character string here.
+#' @param nu Degrees of freedom for the Student-\eqn{t} distribution.
 #'
-#' @return A family object analogous to those described in
-#'   \link[=family]{family}
+#' @return A family object analogous to those described in [`family`].
 #'
 NULL
 

diff --git a/R/helper_formula.R b/R/helper_formula.R
@@ -1,14 +1,16 @@
-#' Utilities to handle formulas for the external user
-#' @name helper_formula
-NULL
-
+#' Break up matrix terms
+#'
 #' Sometimes there can be terms in a formula that refer to a matrix instead of a
-#' single predictor. Because we can handle search_terms of predictors, this
-#' function breaks the matrix term into individual predictors to handle
-#' separately, as that is probably the intention of the user.
-#' @param formula A formula for a valid model.
-#' @param data The original data frame with a matrix as predictor.
-#' @return a  list containing the expanded formula and the expanded data frame.
+#' single predictor. This function breaks up the matrix term into individual
+#' predictors to handle separately, as that is probably the intention of the
+#' user.
+#'
+#' @param formula A [`formula`] for a valid model.
+#' @param data The original `data.frame` with a matrix as predictor.
+#'
+#' @return A `list` containing the expanded [`formula`] and the expanded
+#'   `data.frame`.
+#'
 #' @export
 break_up_matrix_term <- function(formula, data) {
   formula <- expand_formula(formula, data)