From b06bfac24ad857c0d2be554276b9a29f83d896e7 Mon Sep 17 00:00:00 2001 From: Bob Carpenter Date: Mon, 30 Sep 2024 16:19:49 -0400 Subject: [PATCH 1/7] first sketch of feature --- designs/0035-pre-model-gqs.md | 148 ++++++++++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 designs/0035-pre-model-gqs.md diff --git a/designs/0035-pre-model-gqs.md b/designs/0035-pre-model-gqs.md new file mode 100644 index 0000000..eab9bb9 --- /dev/null +++ b/designs/0035-pre-model-gqs.md @@ -0,0 +1,148 @@ +- Feature Name: Generated Data Block +- Start Date: 09-30-2024 + +# Summary +[summary]: #summary + + +This design doc proposes adding a new block to the Stan language called `generated data`. It will appear between the transformed parameter and model blocks. All code in the generated data block is executed based on `double` values like generated quantities and the values it produces are available to the model block + +Here is a sample use case which uses the generated data block to generate randomness in sensitivity and specificity value in order to estimate the prevalence of a disease based on positive tests.. + +```stan +data { + real rho_ss; + vector[2] loc_ss; + vector[2] scale_ss; + int N; + int n; +} +transformed data { + cov_matrix[2, 2] Sigma_ss + = [[scale_ss[1]^2, scale_ss[1] * scale_ss[2] * rho_ss], + [scale_ss[1] * scale_ss[2] * rho_ss, scale_ss[2]^2]]; +} +parameters { + real prevalence; +} +generated data { + vector[2] sens_spec + = inv_logit(multi_normal_rng(loc_ss, Sigma_ss)); +} +model { + real sens = sens_spec[1]; + real spec = sens_spec[2]; + real pos_test_prob = prev * sens + (1 - prev) * (1 - spec); + n ~ binomial(pos_test_prob, N); + prevalence ~ beta(2, 100); +} +``` + +# Motivation +[motivation]: #motivation + +The motivation is to support three new key features for Stan: + +* Multiple imputation of missing data +* Cut-based reasoning for random quantities +* Gibbs sampling discrete parameters + + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +Stan will support a new block, `generated data`, that is required to appear between `transformed parameters` and the `model` block. + +Variables may be declared in the same way as for other blocks. All data types are allowed, including integers. + +Within the `generated data` block, all variables and functions declared in earlier blocks (`functions`, `data`, `transformed data`, `parameters`, and `transformed parameters`) will be available. Variables declared in the `generated data` block will be visible to all later blocks (`model` and `generated quantiries`). + +This `generated data` block allows access to all of the standard Stan functions, including random number generator functions ending in `_rng`. It disallows access to the target log density, so the following operations are not available: target function `target()`, target increment statement `target += ;`, distribution statement `A ~ foo(B);`, and functions ending in `_lp`. + +## Example + +Here's an example that encodes a normal mixture model by sampling discrete parameters. + +```stan +data { + int N; // # observations + vector[N] y; // observations +} +parameters { + vector[2] mu; // component locations + vector[2] sigma; // component scales + real lambda; // mixture ratio +} +generated data { + // sample z ~ p(z | y, lambda, mu, sigma) + array[N] int z; + for (n in 1:N) { + real log_z_eq_1 = log(lambda) + normal_lpdf(y[n] | mu[1], sigma[1]); + real log_z_eq_2 = log1m(lambda) + normal_lpdf(y[n] | mu[2] sigma[2]); + real log_p = log_z_eq_1 - log_sum_exp(log_z_eq_1, log_z_eq_2)); + z[n] = 1 + bernoulli_rng(exp(log_p)); + } +} +model { + // target += p(z | lambda, N) + sum(z) ~ binomial(lambda, N); + + // target += p(y | mu, sigma, z) + y ~ normal(mu[z], sigma[z]); + + // target += p(mu, sigma, lambda) + mu ~ normal(0, 1); + sigma ~ exponential(1); + lambda ~ beta(2, 2); +} +``` + +This is not an efficient way to code this model, but it's the simplest possible example of using the `generated data` block for Gibbs sampling of discrete parameters (i.e., `z`). + + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +The `generated data` block will be executed in exactly the same way as the `generated quantities` block, i.e., with double precision floating point and single-precision integer types. + +Like all of the other blocks, variables are available as soon as they are declared and persist after the block terminates to be available in the `model` and `generated quantities` blocks. + + +# Drawbacks +[drawbacks]: #drawbacks + +It will require additional documentation and tests, increasing Stan's long-term maintenance burden.do things. + +Using "cut"-based inference does not perform full Bayesian inference, and there's a risk that this will confuse our users. + +Using this block for Gibbs sampling will be less efficient than marginalization and may tempt many users to define poorly mixing models. + +There will be no way to test if a Gibbs sampler has defined the correct conditional distribution for variables. + +Introducing a `generated data` opens up a back door approach to converting real numbers to integers. Recall that we do not allow `to_int(real)` to apply to parameters precisely to prevent cutting derivatives and hence inference. This reintroduces a way to do that by applying `to_int(data real)` because the real values in the `generated data` block will be double precision floating point variables rather than autodiff variables. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +- Alternatives + +To support cut-based inference and multiple imputation, it is possible to run Stan multiple times with data generated in the transformed data block. This will not be as efficient in that each value will require Stan to be warmed up and sampled. + +There is no alternative for the Gibbs-based discrete parameter inference. + + +- What is the impact of not doing this? + +Users will find it impossible to do discrete sampling that interacts with the model and will find it very challenging to do imputation. + + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + + +# Citations + +* Plummer, Martyn. 2015. Cuts in Bayesian graphical models. Statistical Computing 25:37--43. + +* Wikipedia. Gibbs Sampling. https://en.wikipedia.org/wiki/Gibbs_sampling + From 1a409f6cb3926c502f7eba7d3d726214a379062f Mon Sep 17 00:00:00 2001 From: Bob Carpenter Date: Mon, 4 Nov 2024 16:02:13 -0500 Subject: [PATCH 2/7] updated design doc with examples --- designs/0035-pre-model-gqs.md | 388 ++++++++++++++++++++++++++++------ 1 file changed, 326 insertions(+), 62 deletions(-) diff --git a/designs/0035-pre-model-gqs.md b/designs/0035-pre-model-gqs.md index eab9bb9..cfc1ea4 100644 --- a/designs/0035-pre-model-gqs.md +++ b/designs/0035-pre-model-gqs.md @@ -4,64 +4,53 @@ # Summary [summary]: #summary - -This design doc proposes adding a new block to the Stan language called `generated data`. It will appear between the transformed parameter and model blocks. All code in the generated data block is executed based on `double` values like generated quantities and the values it produces are available to the model block - -Here is a sample use case which uses the generated data block to generate randomness in sensitivity and specificity value in order to estimate the prevalence of a disease based on positive tests.. - -```stan -data { - real rho_ss; - vector[2] loc_ss; - vector[2] scale_ss; - int N; - int n; -} -transformed data { - cov_matrix[2, 2] Sigma_ss - = [[scale_ss[1]^2, scale_ss[1] * scale_ss[2] * rho_ss], - [scale_ss[1] * scale_ss[2] * rho_ss, scale_ss[2]^2]]; -} -parameters { - real prevalence; -} -generated data { - vector[2] sens_spec - = inv_logit(multi_normal_rng(loc_ss, Sigma_ss)); -} -model { - real sens = sens_spec[1]; - real spec = sens_spec[2]; - real pos_test_prob = prev * sens + (1 - prev) * (1 - spec); - n ~ binomial(pos_test_prob, N); - prevalence ~ beta(2, 100); -} -``` +This design doc proposes adding a new block to the Stan language +called `generated data` with a range of proposed uses. The new block +will appear between the transformed parameter and model blocks and be +executed once at the beginning of each iteration using `double` values +for all parameters and transformed parameters. The `generated data` +variables will then be in scope for the rest of the program, but +cannot be modified. # Motivation [motivation]: #motivation The motivation is to support three new key features for Stan: +* Cut-based reasoning for random quantities * Multiple imputation of missing data -* Cut-based reasoning for random quantities * Gibbs sampling discrete parameters # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -Stan will support a new block, `generated data`, that is required to appear between `transformed parameters` and the `model` block. +Stan will support a new block, `generated data`, that is required to +appear between `transformed parameters` and the `model` block. +Variables may be declared in the same way as for other blocks. All +data types are allowed, including integers. -Variables may be declared in the same way as for other blocks. All data types are allowed, including integers. +Within the `generated data` block, all variables and functions +declared in earlier blocks (`functions`, `data`, `transformed data`, +`parameters`, and `transformed parameters`) will be available. +Variables declared in the `generated data` block will be visible to +all later blocks (`model` and `generated quantiries`). -Within the `generated data` block, all variables and functions declared in earlier blocks (`functions`, `data`, `transformed data`, `parameters`, and `transformed parameters`) will be available. Variables declared in the `generated data` block will be visible to all later blocks (`model` and `generated quantiries`). +The `generated data` block allows access to any function not involving +the target density, including random number generators. It disallows +access to the target log density, so the following operations are not +available: target function `target()`, target increment statement +`target += ;`, distribution statement `A ~ foo(B);`, and +functions ending in `_lp`. -This `generated data` block allows access to all of the standard Stan functions, including random number generator functions ending in `_rng`. It disallows access to the target log density, so the following operations are not available: target function `target()`, target increment statement `target += ;`, distribution statement `A ~ foo(B);`, and functions ending in `_lp`. -## Example +## Gibbs sampling discrete parameters -Here's an example that encodes a normal mixture model by sampling discrete parameters. +Here's an example that encodes a normal mixture model by sampling +discrete parameters. The generative process for a data item first +generates a random component $z_n \sim \textrm{bernoulli}(\lambda)$ and then +generates a data item $y_n \sim \textrm{normal}(\mu_{z_n}, +\sigma_{z_n}).$ ```stan data { @@ -84,56 +73,329 @@ generated data { } } model { - // target += p(z | lambda, N) sum(z) ~ binomial(lambda, N); - // target += p(y | mu, sigma, z) y ~ normal(mu[z], sigma[z]); - // target += p(mu, sigma, lambda) mu ~ normal(0, 1); sigma ~ exponential(1); lambda ~ beta(2, 2); } ``` -This is not an efficient way to code this model, but it's the simplest possible example of using the `generated data` block for Gibbs sampling of discrete parameters (i.e., `z`). +The `generated data` block does for `z` does not look like the simple +generative model. That's because the `generated data` block must code +up the conditional distribution of $z$ given $y, \lambda, \mu,$ and +$\sigma$. The conditional here requires the same derivation as +marginalization. +This is not an efficient way to code this model, but it's the simplest +possible example of using the `generated data` block for Gibbs +sampling of discrete parameters, such as $z$ in this example. More +useful examples might include variable selection with a spike and slab +prior. -# Reference-level explanation -[reference-level-explanation]: #reference-level-explanation +## Block Gibbs for hierarchical models -The `generated data` block will be executed in exactly the same way as the `generated quantities` block, i.e., with double precision floating point and single-precision integer types. +Neal (2011, p. 143) proposes a scheme whereby HMC is only used for the +low-level parameters of a hierarchical model, with the population +parameters being sampled with block Gibbs. In a simple case, suppose +we have a model with $N$ binary observations in $K$ groups with group +indicator $\text{group}$. The likelihood will be $y_n \sim +\text{bernoulli}(\textrm{logit}^{-1}(\alpha_{\text{group}_n}))$ and +the prior $\alpha_k \sim \textrm{normal}(mu, \sigma)$ with conjugate +priors $\sigma^2 \sim \textrm{invGamma}(a, b)$ and $\mu \sim +\textrm{normal}(0, \sigma)$. -Like all of the other blocks, variables are available as soon as they are declared and persist after the block terminates to be available in the `model` and `generated quantities` blocks. +We can use conjugacy to define the posterior for the hyperpriors $\mu, +\sigma^2$ analytically given $\alpha$, +$\sigma^2 \sim \text{invGamma}(a + K/2, b + \text{sum}(alpha - \bar{\alpha}) / 2 + K \cdot \bar{alpha} / (2 \cdot (K + 1))),$ -# Drawbacks -[drawbacks]: #drawbacks +where $\bar{alpha} = \text{mean}(\alpha)$. The conjgate posterior for +$\mu$ given $\alpha$ and $\sigma^2$ is + +$\mu \sim \text{normal}(K \cdot \bar{\alpha} / (K + 1), \sigma / +\sqrt{K + 1}),$ + +where as usual, we are parameterizing the normal distribution by its scale. + +With all of this, we can code Neal's recommendation as follows. + +```stan +data { + int a, b; // gamma prior on variance of effects + int K; + int N; + array[N] int group; + vector[N] y; +} +parameters { + vector[K] alpha; +} +generated data { + real alpha_bar = mean(alpha); + real sigma_alpha + = sqrt(inv_gamma_rng(a + K / 2, + b + 0.5 * sum((alpha - alpha_bar)^2) + + K * alpha_bar^2 / (2 * (K + 1)))); + real mu_alpha = normal_rng(K * mean(alpha) / (K + 1), + sigma_alpha / sqrt(K + 1)); + +} +model { + alpha ~ normal(mu_alpha, sigma_alpha); + y ~ bernoulli_logit(alpha[group]); +} +``` + +The generated data block takes a posterior draw for `sigma_alpha` and +`mu_alpha` conditioned on `alpha`, then HMC/NUTS is only used to +update the low-level paramters `alpha` based on the likelihood and +prior as defined in the model block. + +## Cut and injected randomness + +Here is a simple use case which uses the generated data block to +generate random sensitivity and specificity values from a population +mean and covariance. The population model is used to generate random +sensitivity and specificity values per iteration, pushing their +uncertainty through the model based inferences. + +```stan +data { + vector[2] loc_ss; + cov_matrix[2, 2] Sigma_ss; + int N; + int n; +} +parameters { + real prev; +} +generated data { + vector[2] sens_spec + = inv_logit(multi_normal_rng(loc_ss, Sigma_ss)); +} +model { + real sens = sens_spec[1]; + real spec = sens_spec[2]; + real pos_test_prob = prev * sens + (1 - prev) * (1 - spec); + n ~ binomial(pos_test_prob, N); + prev ~ beta(2, 100); +} +``` -It will require additional documentation and tests, increasing Stan's long-term maintenance burden.do things. +This is related to "cut" in Gibbs sampling (Plummer 2015), which +restricts the flow of information during parameter inference. Cut is +popular in pharmacometric modeling when the pharmacokinetic model is +well understood and well specified, but the pharmacodynamic model is +less well understood, because it prevents the dynamics model from +unduly distorting the kinetic model. -Using "cut"-based inference does not perform full Bayesian inference, and there's a risk that this will confuse our users. +The simplest example of cut would assume we have some calibration data +for sensitivity $(N^\text{sens}, n^\text{sens})$ and for specificity +$(N^\text{spec}, N^\test{spec})$, which we can model as binomial, e.g., +$n^\text{sens} \sim \textrm{binomial}(\textit{sens}, N^\text{sens})$, +where $\textit{sens} \in (0, 1)$ is the sensitivity parameter. -Using this block for Gibbs sampling will be less efficient than marginalization and may tempt many users to define poorly mixing models. +```stan +data { + vector[2] loc_ss; + cov_matrix[2, 2] Sigma_ss; + int N; + int n; + int N_sens; + int n_sens; + int N_spec; + int n_spec; +} +parameters { + real prev; + real sens_cut; + real spec_cut; +} +generated data { + // cuts inference to sens and spec + real sens = sens_cut; + real spec = spec_cut; +} +model { + // allows inference for cut parameters from calibration data + n_sens ~ binomial(N_sens, sens_cut); + n_spec ~ binomial(N_spec, 1 - spec_cut); + + real pos_test_prob = prev * sens + (1 - prev) * (1 - spec); + n ~ binomial(pos_test_prob, N); + prev ~ beta(2, 100); +} +``` -There will be no way to test if a Gibbs sampler has defined the correct conditional distribution for variables. +In this example, `sens_cut` and `spec_cut` work like ordinary +parameters. Their posterior will be defined by their use in the model +block. The variables `sens` and `spec`, which are used to define the +positive test probability and hence estimate `prevalence`, do not feed +information back to `sens_cut` and `spec_cut`. The values of `sens` +and `spec` get updated once at the beginning of each iteration based +on `sens_cut` and `spec_cut` in the previous iteration. Assigning to +`sens` and to `spec` cuts feedback to `sens_cut` and `spec_cut`. -Introducing a `generated data` opens up a back door approach to converting real numbers to integers. Recall that we do not allow `to_int(real)` to apply to parameters precisely to prevent cutting derivatives and hence inference. This reintroduces a way to do that by applying `to_int(data real)` because the real values in the `generated data` block will be double precision floating point variables rather than autodiff variables. + +## Multiple imputation for missing data + +One standard approach to multiple imputation is to use an imputation +model on the data, then propgate multiple data sets through to +inference. That is, we start with a data set $y^\text{inc}$ with +missing data, then impute $y^{\text{miss}(n)}$ for multiple $n$ and +define a total of $N$ complete data sets $y^{(n)} = y^\text{inc}, +y^{\text{miss}(n)}$. For each of these $N$ data sets, we take +$M posterior draws, $\theta^{(n, m)} \sim p(\theta \mid y^{(n)})$. We +then perform standard plug-in Monte Carlo inference with all $N \cdot +M$ draws. + +We will assume that there is a simple missing count data item and we +will deal with the missingness through a Poisson regression. + +```stan +data { + int N; + vector[N] x1_obs, x2_obs; + int[N] miss1, miss2; + vector[N] y; +} +parameters { + real alpha1, alpha2; + real sigma1, sigma2; + real beta1, beta2; + real sigma; + +} +generated data { + vector[N] x1 = x1_obs, x2 = x_obs; + for (n in 1:N) { + if (missing1[n]) { + x1[n] = normal_rng(alpha1 * x2[n], sigma1); + } + if (missing2[n]) { + x2[n] = normal_rng(alpha2 * x1[n], sigma2); + } + } +} +model { + x1 ~ normal(alpha1 * x2, sigma1); + x2 ~ normal(alpha2 * x1, sigma2); + y ~ normal(beta1 * x1 + beta2 * x2, sigma); +} +``` + +The result will be a multiple imputation, not a coherent joint +Bayesian model. This will work even if the missing data is discrete +or the conditional distributions are not coherent. + +## Stochastic gradient descent + +The usual model in Stan evaluates the likelihood for the given data +set and a prior. There has, until now, never been a good way to do +stochastic gradient descent over subsampled data. Now we can do +that. This is the simplest possible example. Suppose we start with +this model. + +```stan +data { + int N; + array[N] int y; +} +parameters { + real theta; +} +model { + theta ~ beta(5, 5); + y ~ bernoulli(theta); +} +``` + +We can move to stochastic gradient descent by randomly subsampling `y` +in the `generated data` block. As we do this, it's traditional to +scale the likelihood to match the original data size. + + +```stan +data { + int N; + int N_sub; + array[N] int y; +} +transformed data { + real one_over_N_sub = 1.0 / N_sub; + real N_over_N_sub = N * one_over_N_sub; + simplex[N_sub] unif = rep_vector(one_over_N_sub, N_sub); +} +parameters { + real theta; +} +generated data { + array[N_sub] y_sub; + for (n in 1:N_sub) { + y_sub[n] = categorical_rng(unif); + } +} +model { + theta ~ beta(5, 5); + # scale data weight to match population size + target += N_over_N_sub * bernoulli_lpmf(y_sub | theta); +} +``` + +If you plug this model into optimization, by default it will apply +quasi-Newton steps using stochastic gradient. + + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +The `generated data` block will be executed in exactly the same way as +the `generated quantities` block, i.e., with all previous variables +coded as double precision floating point and single-precision integer +types. + +Like all of the other blocks, variables from previous blocks will be +available in the `generated data` block, and variables from the +`generated data` block will be available in all subsequent blocks. + +# Drawbacks +[drawbacks]: #drawbacks + +1. It will require additional documentation and tests, increasing + Stan's long-term maintenance burden. +2. Using "cut"-based inference does not perform full Bayesian + inference, and there's a risk that this will confuse our users. +3. Using this block for Gibbs sampling will be less efficient than + marginalization and may tempt many users to define poorly mixing + models. +4. There is nothing enforcing consistency of the conditional + distributions when doing Gibbs, so this will be easy to get wrong + and there won't be an easy way to test that this is right. +5. Introducing a `generated data` opens up a back door approach to + converting real numbers to integers, which can be done with + `to_int(data real)` if the argument is "data" (i.e., a primitive + rather than autodiff variable). # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives -- Alternatives - -To support cut-based inference and multiple imputation, it is possible to run Stan multiple times with data generated in the transformed data block. This will not be as efficient in that each value will require Stan to be warmed up and sampled. +## Alternatives -There is no alternative for the Gibbs-based discrete parameter inference. +To support pure uncertainty propagation and non-interacting forms of +cut (i.e., ones that don't use any of the model parameters), we can +use multiple imputation in the traditional way by running multiple +Stan programs in sequence. This is less efficient and not as general. +There is no alternative for the Gibbs-based discrete parameter +inference. -- What is the impact of not doing this? +## Impact of not implementing -Users will find it impossible to do discrete sampling that interacts with the model and will find it very challenging to do imputation. +Users will find it impossible to do discrete sampling that interacts +with the model and will find it very challenging to do imputation. # Unresolved questions @@ -142,7 +404,9 @@ Users will find it impossible to do discrete sampling that interacts with the mo # Citations -* Plummer, Martyn. 2015. Cuts in Bayesian graphical models. Statistical Computing 25:37--43. +* Plummer, Martyn. 2015. Cuts in Bayesian graphical models. *Statistical Computing* 25:37--43. * Wikipedia. Gibbs Sampling. https://en.wikipedia.org/wiki/Gibbs_sampling +* Neal, Radford. 2011. MCMC using Hamiltonian dynamics. In *Handbook of + MCMC.* CRC Press. From b6bb74a9fe58fa78084af79524c8c6312e6b7ebc Mon Sep 17 00:00:00 2001 From: Bob Carpenter Date: Mon, 4 Nov 2024 16:06:07 -0500 Subject: [PATCH 3/7] fix math typos --- designs/0035-pre-model-gqs.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/designs/0035-pre-model-gqs.md b/designs/0035-pre-model-gqs.md index cfc1ea4..1ff00c2 100644 --- a/designs/0035-pre-model-gqs.md +++ b/designs/0035-pre-model-gqs.md @@ -102,20 +102,20 @@ low-level parameters of a hierarchical model, with the population parameters being sampled with block Gibbs. In a simple case, suppose we have a model with $N$ binary observations in $K$ groups with group indicator $\text{group}$. The likelihood will be $y_n \sim -\text{bernoulli}(\textrm{logit}^{-1}(\alpha_{\text{group}_n}))$ and -the prior $\alpha_k \sim \textrm{normal}(mu, \sigma)$ with conjugate +\text{bernoulli}(\textrm{logit}^{-1}(\alpha_{\text{group}[n]}))$ and +the prior $\alpha_k \sim \textrm{normal}(\mu, \sigma)$ with conjugate priors $\sigma^2 \sim \textrm{invGamma}(a, b)$ and $\mu \sim \textrm{normal}(0, \sigma)$. We can use conjugacy to define the posterior for the hyperpriors $\mu, \sigma^2$ analytically given $\alpha$, -$\sigma^2 \sim \text{invGamma}(a + K/2, b + \text{sum}(alpha - \bar{\alpha}) / 2 + K \cdot \bar{alpha} / (2 \cdot (K + 1))),$ +$\sigma^2 \sim \text{invGamma}(a + K/2, b + \text{sum}(\alpha - \widebar{\alpha}) / 2 + K \cdot \widebar{\alpha} / (2 \cdot (K + 1))),$ -where $\bar{alpha} = \text{mean}(\alpha)$. The conjgate posterior for +where $\widebar{\alpha} = \text{mean}(\alpha)$. The conjgate posterior for $\mu$ given $\alpha$ and $\sigma^2$ is -$\mu \sim \text{normal}(K \cdot \bar{\alpha} / (K + 1), \sigma / +$\mu \sim \text{normal}(K \cdot \widebar{\alpha} / (K + 1), \sigma / \sqrt{K + 1}),$ where as usual, we are parameterizing the normal distribution by its scale. @@ -194,7 +194,7 @@ unduly distorting the kinetic model. The simplest example of cut would assume we have some calibration data for sensitivity $(N^\text{sens}, n^\text{sens})$ and for specificity -$(N^\text{spec}, N^\test{spec})$, which we can model as binomial, e.g., +$(N^\text{spec}, N^\text{spec})$, which we can model as binomial, e.g., $n^\text{sens} \sim \textrm{binomial}(\textit{sens}, N^\text{sens})$, where $\textit{sens} \in (0, 1)$ is the sensitivity parameter. From 9a3a96973da4297cfcb3022bc1d9aac9de3dc841 Mon Sep 17 00:00:00 2001 From: Bob Carpenter Date: Mon, 4 Nov 2024 16:07:29 -0500 Subject: [PATCH 4/7] LaTeX macro fix --- designs/0035-pre-model-gqs.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/designs/0035-pre-model-gqs.md b/designs/0035-pre-model-gqs.md index 1ff00c2..c166a6b 100644 --- a/designs/0035-pre-model-gqs.md +++ b/designs/0035-pre-model-gqs.md @@ -110,12 +110,12 @@ priors $\sigma^2 \sim \textrm{invGamma}(a, b)$ and $\mu \sim We can use conjugacy to define the posterior for the hyperpriors $\mu, \sigma^2$ analytically given $\alpha$, -$\sigma^2 \sim \text{invGamma}(a + K/2, b + \text{sum}(\alpha - \widebar{\alpha}) / 2 + K \cdot \widebar{\alpha} / (2 \cdot (K + 1))),$ +$\sigma^2 \sim \text{invGamma}(a + K/2, b + \text{sum}(\alpha - \overline{\alpha}) / 2 + K \cdot \overline{\alpha} / (2 \cdot (K + 1))),$ -where $\widebar{\alpha} = \text{mean}(\alpha)$. The conjgate posterior for +where $\overline{\alpha} = \text{mean}(\alpha)$. The conjgate posterior for $\mu$ given $\alpha$ and $\sigma^2$ is -$\mu \sim \text{normal}(K \cdot \widebar{\alpha} / (K + 1), \sigma / +$\mu \sim \text{normal}(K \cdot \overline{\alpha} / (K + 1), \sigma / \sqrt{K + 1}),$ where as usual, we are parameterizing the normal distribution by its scale. From 4e013c49fa985cc142c8bf39ff95007b6a0fcbe9 Mon Sep 17 00:00:00 2001 From: Bob Carpenter Date: Thu, 5 Dec 2024 16:03:37 -0500 Subject: [PATCH 5/7] design doc rc for latent data --- designs/0035-pre-model-gqs.md | 206 +++++++++++++++++++++++++++++----- 1 file changed, 175 insertions(+), 31 deletions(-) diff --git a/designs/0035-pre-model-gqs.md b/designs/0035-pre-model-gqs.md index c166a6b..66845ff 100644 --- a/designs/0035-pre-model-gqs.md +++ b/designs/0035-pre-model-gqs.md @@ -1,14 +1,14 @@ -- Feature Name: Generated Data Block +- Feature Name: Latent Data Block - Start Date: 09-30-2024 # Summary [summary]: #summary This design doc proposes adding a new block to the Stan language -called `generated data` with a range of proposed uses. The new block +called `latent data` with a range of proposed uses. The new block will appear between the transformed parameter and model blocks and be executed once at the beginning of each iteration using `double` values -for all parameters and transformed parameters. The `generated data` +for all parameters and transformed parameters. The `latent data` variables will then be in scope for the rest of the program, but cannot be modified. @@ -25,18 +25,18 @@ The motivation is to support three new key features for Stan: # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -Stan will support a new block, `generated data`, that is required to +Stan will support a new block, `latent data`, that is required to appear between `transformed parameters` and the `model` block. Variables may be declared in the same way as for other blocks. All data types are allowed, including integers. -Within the `generated data` block, all variables and functions +Within the `latent data` block, all variables and functions declared in earlier blocks (`functions`, `data`, `transformed data`, `parameters`, and `transformed parameters`) will be available. -Variables declared in the `generated data` block will be visible to +Variables declared in the `latent data` block will be visible to all later blocks (`model` and `generated quantiries`). -The `generated data` block allows access to any function not involving +The `latent data` block allows access to any function not involving the target density, including random number generators. It disallows access to the target log density, so the following operations are not available: target function `target()`, target increment statement @@ -62,7 +62,7 @@ parameters { vector[2] sigma; // component scales real lambda; // mixture ratio } -generated data { +latent data { // sample z ~ p(z | y, lambda, mu, sigma) array[N] int z; for (n in 1:N) { @@ -83,14 +83,14 @@ model { } ``` -The `generated data` block does for `z` does not look like the simple -generative model. That's because the `generated data` block must code +The `latent data` block does for `z` does not look like the simple +generative model. That's because the `latent data` block must code up the conditional distribution of $z$ given $y, \lambda, \mu,$ and $\sigma$. The conditional here requires the same derivation as marginalization. This is not an efficient way to code this model, but it's the simplest -possible example of using the `generated data` block for Gibbs +possible example of using the `latent data` block for Gibbs sampling of discrete parameters, such as $z$ in this example. More useful examples might include variable selection with a spike and slab prior. @@ -133,7 +133,7 @@ data { parameters { vector[K] alpha; } -generated data { +latent data { real alpha_bar = mean(alpha); real sigma_alpha = sqrt(inv_gamma_rng(a + K / 2, @@ -141,7 +141,7 @@ generated data { + K * alpha_bar^2 / (2 * (K + 1)))); real mu_alpha = normal_rng(K * mean(alpha) / (K + 1), sigma_alpha / sqrt(K + 1)); - + } model { alpha ~ normal(mu_alpha, sigma_alpha); @@ -149,14 +149,14 @@ model { } ``` -The generated data block takes a posterior draw for `sigma_alpha` and +The latent data block takes a posterior draw for `sigma_alpha` and `mu_alpha` conditioned on `alpha`, then HMC/NUTS is only used to update the low-level paramters `alpha` based on the likelihood and prior as defined in the model block. ## Cut and injected randomness -Here is a simple use case which uses the generated data block to +Here is a simple use case which uses the latent data block to generate random sensitivity and specificity values from a population mean and covariance. The population model is used to generate random sensitivity and specificity values per iteration, pushing their @@ -172,7 +172,7 @@ data { parameters { real prev; } -generated data { +latent data { vector[2] sens_spec = inv_logit(multi_normal_rng(loc_ss, Sigma_ss)); } @@ -214,7 +214,7 @@ parameters { real sens_cut; real spec_cut; } -generated data { +latent data { // cuts inference to sens and spec real sens = sens_cut; real spec = spec_cut; @@ -269,7 +269,7 @@ parameters { real sigma; } -generated data { +latent data { vector[N] x1 = x1_obs, x2 = x_obs; for (n in 1:N) { if (missing1[n]) { @@ -314,7 +314,7 @@ model { ``` We can move to stochastic gradient descent by randomly subsampling `y` -in the `generated data` block. As we do this, it's traditional to +in the `latent data` block. As we do this, it's traditional to scale the likelihood to match the original data size. @@ -332,7 +332,7 @@ transformed data { parameters { real theta; } -generated data { +latent data { array[N_sub] y_sub; for (n in 1:N_sub) { y_sub[n] = categorical_rng(unif); @@ -350,22 +350,163 @@ quasi-Newton steps using stochastic gradient. # Reference-level explanation -[reference-level-explanation]: #reference-level-explanation +[reference-level-explanation]: #reference-level-explanation + +## Stan Reference Manual + +Stan's block structure and execution model is described in the +reference manual in the +[Program Blocks chapter](https://mc-stan.org/docs/reference-manual/blocks.html) +and the +[Program Execution chapter](https://mc-stan.org/docs/reference-manual/execution.html). +This feature will add a new section to the program blocks chapter, +which will read as follows. + +### Program Blocks for latent data + +The latent data block appears after the transformed parameters block +but before the odel block. Like the parameters, transformed +parameters, and generated quantities, the latent data will be included +in the output and the top-level variables will be available in all +later blocks. Latent data will also have access to the data, +transformed data, parameters, and transformed parameters. + +#### Iteration number + +The iteration number can be optionally set in the latent data block +from the outside. There will be an `iteration_number__` parameter +reserved. It will be set to 0 by default, but can be updated by an +external algorithm. + +### Program Execution for latent data + +For all of our inference algorithms (HMC, ADVI, Pathfinder, +optimization, Laplace), the latent data block should be executed once +at the start of each iteration. For example, in Hamiltonian Monte +Carlo, the latent data block is executed once based on the initial +parameter values, then the values are used without changing each +leapfrog iteration. For optimization, each iteration of L-BFGS should +evaluate the latent data block once per iteration and leave it fixed +through the line search. The latent data needs to be executed before +the model block. + +By default, the latent data block will be empty and there will be +nothing to compute. This should be possible to set up so that all of +the inference methods are backwards compatible. + +As with blocks other than the parameters block, any constraints on the +declared variables will be checked at the end of the block and an +exception will be raised if they are violated, which will cause the +current iteration in any of the algorithms to be rejected. + +## C++ model class + +The `latent data` block will be executed with primitive (`double` and +`int` in C++) variables, just like the `generated quantities` block. +This effectively "cuts" the gradient information from propagating back +through the latent data block. Nevertheless, computation in the +latent data block can affect what happens in the model block by +defining new variables on which the model block may condition. + +### Representing the latent data + +The design is based on an object-based representation of the latent +data using code generation. For example, suppose the latent data +block is defined as follows. + + +```cpp +latent data { + real sens = beta_rng(98, 2); + real spec = beta_rng(95, 5); +} +``` + +This will lead to the code generation of a class in the generated +`.hpp` file for the model. -The `generated data` block will be executed in exactly the same way as -the `generated quantities` block, i.e., with all previous variables -coded as double precision floating point and single-precision integer -types. +```cpp +struct LatentData { + int iteration_number__; // included by default in all LatentData + double sens; + double spec; +}; +``` -Like all of the other blocks, variables from previous blocks will be -available in the `generated data` block, and variables from the -`generated data` block will be available in all subsequent blocks. +### Generating the latent data + +An additional method in the model class will be required to populate +the latent data. + +```cpp +template inline void +latent_data( + const VectorXd& params_r, + LatentData& latent_data__, + RNG& base_rng, + int iteration = -1, + std::ostream* pstream=nullptr) const { + latent_data__.iteration_number__ = iteration; + ... code generated to populate latent_data__ ... +} +``` + +The context will be responsible for managing the memory of any data +structures in LatentData, such as `Eigen::Matrix` or `std::vector` +instances. By reusing the same `LatentData` variable (per thread), +variables can be reset rather than allocating and freeing each evaluation. + + +### Modified log_prob method + +The existing `log_prob` method must be updated to maintain backward +compatiblity: + +```cpp +template inline T_ +log_prob(Eigen::Matrix& params_r, + const LatentData& latent_data__, + std::ostream* pstream = nullptr) const; +``` + +In this case, the `log_prob` code generation must be updated to copy +all of the generated data into scope. + +```cpp +template inline T_ +log_prob( + Eigen::Matrix& params_r, + const LatentData& latent_data , + std::ostream* pstream = nullptr) const { + double sens = latent_data__.sens; + double spec = latent_data__.spec; + ... code generate as before with sens/spec in scope ... +} +``` + +For ease of backward compatiblity, we can have the old implementation +with a default value, + +```cpp +template inline T_ +log_prob( + Eigen::Matrix& params_r, + std::ostream* pstream = nullptr) const { + LatentData latent_data, + return log_prob(params_r, latent_data, pstream) const; +``` + +### BridgeStan + +BridgeStan can be updated to mimic the updates to the Stan model +class. # Drawbacks [drawbacks]: #drawbacks 1. It will require additional documentation and tests, increasing - Stan's long-term maintenance burden. + Stan's long-term maintenance burden and the overhead required for + someone to understand what it's doing. 2. Using "cut"-based inference does not perform full Bayesian inference, and there's a risk that this will confuse our users. 3. Using this block for Gibbs sampling will be less efficient than @@ -374,7 +515,7 @@ available in the `generated data` block, and variables from the 4. There is nothing enforcing consistency of the conditional distributions when doing Gibbs, so this will be easy to get wrong and there won't be an easy way to test that this is right. -5. Introducing a `generated data` opens up a back door approach to +5. Introducing a `latent data` block opens up a back door approach to converting real numbers to integers, which can be done with `to_int(data real)` if the argument is "data" (i.e., a primitive rather than autodiff variable). @@ -395,12 +536,15 @@ inference. ## Impact of not implementing Users will find it impossible to do discrete sampling that interacts -with the model and will find it very challenging to do imputation. +with the model and will find it very challenging to do missing data +imputation. # Unresolved questions [unresolved-questions]: #unresolved-questions +None known. + # Citations From cd05dc0fc7d0ba327ab050d07ad3946f4a041856 Mon Sep 17 00:00:00 2001 From: Bob Carpenter Date: Thu, 5 Dec 2024 16:09:20 -0500 Subject: [PATCH 6/7] fix typos --- designs/0035-pre-model-gqs.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/designs/0035-pre-model-gqs.md b/designs/0035-pre-model-gqs.md index 66845ff..00710fb 100644 --- a/designs/0035-pre-model-gqs.md +++ b/designs/0035-pre-model-gqs.md @@ -34,7 +34,7 @@ Within the `latent data` block, all variables and functions declared in earlier blocks (`functions`, `data`, `transformed data`, `parameters`, and `transformed parameters`) will be available. Variables declared in the `latent data` block will be visible to -all later blocks (`model` and `generated quantiries`). +all later blocks (`model` and `generated quantities`). The `latent data` block allows access to any function not involving the target density, including random number generators. It disallows @@ -112,7 +112,7 @@ We can use conjugacy to define the posterior for the hyperpriors $\mu, $\sigma^2 \sim \text{invGamma}(a + K/2, b + \text{sum}(\alpha - \overline{\alpha}) / 2 + K \cdot \overline{\alpha} / (2 \cdot (K + 1))),$ -where $\overline{\alpha} = \text{mean}(\alpha)$. The conjgate posterior for +where $\overline{\alpha} = \text{mean}(\alpha)$. The conjugate posterior for $\mu$ given $\alpha$ and $\sigma^2$ is $\mu \sim \text{normal}(K \cdot \overline{\alpha} / (K + 1), \sigma / @@ -151,7 +151,7 @@ model { The latent data block takes a posterior draw for `sigma_alpha` and `mu_alpha` conditioned on `alpha`, then HMC/NUTS is only used to -update the low-level paramters `alpha` based on the likelihood and +update the low-level parameters `alpha` based on the likelihood and prior as defined in the model block. ## Cut and injected randomness @@ -460,7 +460,7 @@ variables can be reset rather than allocating and freeing each evaluation. ### Modified log_prob method The existing `log_prob` method must be updated to maintain backward -compatiblity: +compatibility: ```cpp template inline T_ @@ -484,7 +484,7 @@ log_prob( } ``` -For ease of backward compatiblity, we can have the old implementation +For ease of backward compatibility, we can have the old implementation with a default value, ```cpp From 3a3d90b55b119e92925b8e4f039b67ecacd82358 Mon Sep 17 00:00:00 2001 From: Bob Carpenter Date: Fri, 6 Dec 2024 12:07:08 -0500 Subject: [PATCH 7/7] crtp latent_data base classes --- designs/0035-pre-model-gqs.md | 172 +++++++++++++++++++++++++++++----- 1 file changed, 147 insertions(+), 25 deletions(-) diff --git a/designs/0035-pre-model-gqs.md b/designs/0035-pre-model-gqs.md index 00710fb..3370b49 100644 --- a/designs/0035-pre-model-gqs.md +++ b/designs/0035-pre-model-gqs.md @@ -408,6 +408,47 @@ through the latent data block. Nevertheless, computation in the latent data block can affect what happens in the model block by defining new variables on which the model block may condition. + +### Review of model class design + +Stan's transpiler, `stanc3`, converts Stan code to a C++ class +definition along with some helper functions. In order to provide a +workable interface to the superclass without invoking virtual function +calls, the model class follows the +[curiously recurring template pattern](https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern) +(CRTP). The code generated by the transpiler for the model `foo` +declares a class in the folloiwng way. + + +```cpp +class foo_model final : public model_base_crtp { + +} +``` + +The `model_base_crtp` then provides CRTP-based implementations of +methods for I/O that depend on features of the model. It is defined +to extend the `model_base` class, which is used by the external +interfaces. + +To deal with construction, the generated C++ code for a model also +includes a generic factory method to return a base model. + +```cpp +stan::model::model_base& +new_model(stan::io::var_context& data_context, unsigned int seed, + std::ostream* msg_stream); + ``` + +This simply creates an instance of `foo_model` with the specified +arguments and returns it. + +For reference, here are links to the full definitions from the +`stan-dev/stan` GitHub repository: + +* [`model_base.hpp`](https://github.com/stan-dev/stan/blob/develop/src/stan/model/model_base.hpp) +* [`model_base_crtp.hpp`](https://github.com/stan-dev/stan/blob/develop/src/stan/model/model_base_crtp.hpp) + ### Representing the latent data The design is based on an object-based representation of the latent @@ -423,16 +464,33 @@ latent data { ``` This will lead to the code generation of a class in the generated -`.hpp` file for the model. +`.hpp` file for the model following the CRTP. ```cpp -struct LatentData { - int iteration_number__; // included by default in all LatentData +struct latent_data : public latent_data_base_crtp { + int iteration_number__; // included by default in all latent_data double sens; double spec; }; ``` +Here, the `latent_data_base_crtp` will extend `latent_data_base`, +following the pattern used for the model class. + +### Constructing the latent data generically + +Just like the model class, this will require a generic factory method, +though in this case, no arguments are required as it will be used like +a simple `struct`. + +```cpp +latent_data_base& new_latent_data(); +``` + +This must then get passed where necessary and cast into the relevant +type in the class definition, as shown in the next section. + + ### Generating the latent data An additional method in the model class will be required to populate @@ -440,24 +498,69 @@ the latent data. ```cpp template inline void -latent_data( - const VectorXd& params_r, - LatentData& latent_data__, - RNG& base_rng, - int iteration = -1, +set_latent_data( + const VectorXd& params_r, + latent_data_base& ldb, + RNG& base_rng, std::ostream* pstream=nullptr) const { - latent_data__.iteration_number__ = iteration; + latent_data& latent_data__ = static_cast(ldb); ... code generated to populate latent_data__ ... } ``` -The context will be responsible for managing the memory of any data -structures in LatentData, such as `Eigen::Matrix` or `std::vector` -instances. By reusing the same `LatentData` variable (per thread), -variables can be reset rather than allocating and freeing each evaluation. +Because it does not use anything specific to the specific model class, +this can be implemented inthe `latent_data_base_crtp` class using the +CRTP. + +``` +template +class latent_data_base_crtp { +template inline void +set_latent_data( + const VectorXd& params_r, + latent_data_base& ldb, + RNG& base_rng, + std::ostream* pstream=nullptr) const { + static_cast(ldb).set_latent_data(params_r, ldb, + base_rng, pstream); +}; +``` + +The concrete `latent_data` class implemented along with the model +class will be responsible for holding the memory for any data needed +in latent data, such as `Eigen::Matrix` or `std::vector` +instances. By reusing the same `latent_data` variable (per thread), +variables can be reset rather than allocating and freeing each +evaluation. + +### The `latent_data_base` class + +The `latent_data_base` class will only know about how to write +iteration numbers and how to write out variables and variable names. + +```cpp +class latent_data_base { + int iteration_number_; + int get_iteration_number() { + return iteration_number_; + } + void set_iteration_number(int n) { + iteration_number_ = n; + } + virtual void num_variables() const = 0; + virtual void write_variable_names(std::vector& names) const = 0; + virtual void write_variable_values(std::vector& values) const = 0; +}; +``` + +The CRTP class will impelement the three virtual functions using the +CRTP in the usual way. +This uses inheritance from the base class, but the cost is relatively +low in that it is only done once per iteration. This follows the way +Stan's `model_base` class is designed. -### Modified log_prob method +### Generalized `log_prob` method The existing `log_prob` method must be updated to maintain backward compatibility: @@ -465,7 +568,7 @@ compatibility: ```cpp template inline T_ log_prob(Eigen::Matrix& params_r, - const LatentData& latent_data__, + const latent_data& latent_data__, std::ostream* pstream = nullptr) const; ``` @@ -476,26 +579,45 @@ all of the generated data into scope. template inline T_ log_prob( Eigen::Matrix& params_r, - const LatentData& latent_data , + const latent_data_base& ldb, std::ostream* pstream = nullptr) const { + const latent_data& latent_data = static_cast(ldb); double sens = latent_data__.sens; double spec = latent_data__.spec; ... code generate as before with sens/spec in scope ... } ``` -For ease of backward compatibility, we can have the old implementation -with a default value, +Note that it is implemented generically using `latent_data_base` for +use from outside interfaces. -```cpp -template inline T_ -log_prob( - Eigen::Matrix& params_r, - std::ostream* pstream = nullptr) const { - LatentData latent_data, - return log_prob(params_r, latent_data, pstream) const; + +### Generalized `write_array` method + +The existing `write_array` method must be similarly updated, because +now it is going to need to write the generated data. + +```cpp +template inline void +write_array(RNG& base_rng, Eigen::Matrix& params_r, + Eigen::Matrix& vars, + const bool emit_transformed_parameters = true, + const bool emit_generated_quantities = true, + std::ostream* pstream = nullptr, + const latent_data_base& ldb = dummy_latent_data()) const { + latent_data& latent_data = static_cast(ldb); + ... +} ``` +As with the `log_prob` method, this one also takes a generic +`latent_data_base` as an argument to permit use from the outside. + +By making it the last argument and giving it a dummy value (just a +no-data dummy implementation), it will remain backward compatible to +outside interfaces that have not yet been made aware of `latent_data`. + + ### BridgeStan BridgeStan can be updated to mimic the updates to the Stan model