4_PPD_PPC.Rmd

---
title: "4. PPD (Prior Predictive Distributions) and PPC (Posterior Predictive Checks)"
output: html_notebook
---


## Introduction

Bayesian inference starts with a prior distribution and ends with a posterior distribution. This raises two important questions:

1. How do we know that our prior is appropriately capturing realistic prior information; in other words, how do we convince a skeptic that our prior is reasonable?

2. How do we know that the posterior gives us valid inference?

Both of these questions are deep and difficult. We won't be able to answer either completely and satisfactorily for all tastes. However, we can, at least, try to address both questions.

To understand our prior, we can do two things:

1. We can graph it to get a sense of the probability it places on the parameter space.

2. We can generate the *prior predictive distribution* by simulating fake data from the range of possible parameter values, weighted by the probability of those values in the prior.

To understand our posterior, we can perform a *posterior predictive check*. This also involves simulating fake data, but this time from the parameters of the posterior distribution. The idea is that we can compare this fake data to the real data to make sure that the posterior is giving predictions that are consistent with the data set with which we started. Unless the prior is extremely strong and the data is scarce, the posterior should always be in reasonable agreement with the data. 


## Preliminaries

Load necessary packages:

```{r}
library(tidyverse)
library(rstan)
library(tidybayes)
library(bayesplot)
```

```{r}
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
```


## Data

We'll continue to use the example of 12 successes in 18 trials from the `2_Continuous_Bayes.Rmd` notes.

```{r}
N1 <- 18  # Define the sample size
y1 <- c(rep(1, 12), rep(0, 6))  # 12 successes and 6 failures
y1
```

```{r}
stan_data <- list(N = N1, y = y1)
stan_data
```

## Uniform prior

We'll start by assuming a uniform prior.

### The prior distribution and the PPD (Prior Predictive Distribution)

We can sample from the prior and simulate data from the prior using a `generated quantities` block in the Stan model code. In this block, we declare any variables we need, similar to what we would have done in the `data` or `parameters` blocks. Then we generate a random draw from the prior by appending `_rng` (for "random number generator") after the name of the probability distribution. Finally, with a random value of the parameter chosen, we draw another complete data set (of size `N`).

The result of simulating a draw from the prior and then sampling data using that parameter value is called a "prior predictive distribution".

In the Stan code below, we pass in the data (because Stan requires a `data` block to run properly), but we won't use any of it except for referencing the sample size `N`. Compare this code to the Stan model code in the `3.First_Stan_examples.Rmd` notes. There is no `parameters` block and no `model` block. The two variables we need---`theta` and `y_sim`---have to be declared at the top of the `generated quantities` block. Then, instead of using the tilde (~) to assign a probability distribution, we use an equal sign (=) to assign a randomly generated number from a probability distribution.

```{stan, output.var = "bin_unif_prior", cache = TRUE}
data {
    int<lower = 0> N;
    array[N] int<lower = 0, upper = 1> y;
}
generated quantities {
    real theta;
    array[N] int y_sim;
    
    theta = uniform_rng(0, 1);  // Draw randomly from the prior
    for (n in 1:N) {
        y_sim[n] = bernoulli_rng(theta);  // Simulate new data
    }
}
```

The `sampling` function requires special arguments to tell it that we are not interested in using MCMC to sample parameter values. We'll only need one chain using a "fixed parameter" algorithm. The Stan model above is only generating random numbers. While we're at it, we'll also tell Stan not to print out all the information about processing the chains by using the argument `refresh = 0`.

```{r, cache = TRUE}
fit_bin_unif_prior <- sampling(bin_unif_prior,
                               data = stan_data,
                               chains = 1,
                               algorithm = "Fixed_param",
                               seed = 54321,
                               refresh = 0)
```

```{r}
fit_bin_unif_prior
```

Now we can look at the shape of the prior:

```{r}
mcmc_hist(fit_bin_unif_prior, pars = "theta")
```

Yep, that looks like a uniform distribution.

We can also look at the data generated by the prior. To do this, we'll need to pull out the `y_sim` columns and convert the resulting grid to a matrix:

```{r}
y_sim_bin_unif_prior <- tidy_draws(fit_bin_unif_prior) %>%
    dplyr::select(starts_with("y_sim")) %>%
    as.matrix()
```

There will be lots of different ways to summarize the prior predictive distribution in general, but for this simple Bernoulli data (0s and 1s), we can just looks at bar graphs that show the relative number of 0s and 1s. (The closest command we have in `tidybayes` is a histogram, but it will do the job.) We have 1000 draws and we can't look at that many. Let's look at the first 20:

```{r}
ppd_hist(ypred = y_sim_bin_unif_prior[1:20,])
```

We see a lot of variability, as we'd expect for a uniform prior.

### PPC (Posterior Predictive Check)

The easiest way to perform a PPC is to include a `generated quantites` block in the Stan model code. Below, we reproduce the same model code from the `3.First_Stan_examples.Rmd` notes, but with a `generated quantities block` specifically for generating simulated data from the posterior.

```{stan, output.var = "bin_unif", cache = TRUE}
data {
    int<lower = 0> N;
    array[N] int<lower = 0, upper = 1> y;
}
parameters {
    real<lower = 0, upper = 1> theta;
}
model {
    theta ~ uniform(0, 1);  // prior
    y ~ bernoulli(theta);   // likelihood
}
generated quantities {
    array[N] int y_sim;
    
    for(n in 1:N) {
        y_sim[n] = bernoulli_rng(theta);
    }
}
```


Now we sample from the model using our data in the usual way.

```{r, cache = TRUE}
fit_bin_unif <- sampling(bin_unif,
                         data = stan_data,
                         seed = 54321,
                         refresh = 0)
```

As a reminder, here is what the posterior distribution looks like:

```{r}
mcmc_dens(fit_bin_unif, pars = "theta")
```

The tools for the PPC are the same as there for the PPD; the only difference is that we've drawn from the posterior instead of the prior.

We will still extract the `y_sim` columns and turn them into a matrix:

```{r}
y_sim_bin_unif <- tidy_draws(fit_bin_unif) %>%
    dplyr::select(starts_with("y_sim")) %>%
    as.matrix()
```

The commands for the PPC are very similar as they were for the PPD. We need to change `ppd_` to `ppc_`. Then, the argument is no longer called `ypred`. Instead, `ppc_` functions require two arguments: `y` and `yrep`. The `y` values are the original data (`y1`). The `yrep` values are the random draws created in the `generated quantities` block in matrix form that we created just above.

The first bar graph below (dark blue) is the actual data, 6 failures and 12 successes. The remaining 19 bar graphs (light blue) are the first 19 rows of the simulated data values.

```{r}
ppc_hist(y = y1,
         yrep = y_sim_bin_unif[1:19,])
```

The aim of the PPC is to compare the simulated data to the actual data. Here we note that the majority of simulations shown above produced fake data that is somewhat similar to the real data. Most of the simulations show more successes than failures. There is a great deal of variability in the posterior distribution, so we won't expect every simulation to produce exactly the same proportion of failures and successes. Nevertheless, we want to check that our posterior isn't making predictions that are wildly different from the real data. The posterior distribution should always be reasonably consistent with the data.

Here is another way of looking at the simulated posterior data. The lighter bars show the failure and success counts in the actual data. The darker blue intervals are the middle 90% of all values obtained in the simulations. The center (median) dots are right on top of the data values, but in general, this need not be true. As long as the median dots are close to the top of the ligher bars, we know that the posterior predictions are close to the original data.

```{r}
ppc_bars(y = y1,
         yrep = y_sim_bin_unif)
```


## Normal prior

The next example in `2_Continuous_Bayes.Rmd` was a normal prior:

$$
\theta \sim N(0.3, 0.1).
$$

### The prior distribution and the PPD (Prior Predictive Distribution)

The prior predictive distribution should be different in this case. The code below is similar to the earlier example, but using a normal distribution instead of a uniform one. We do have to add a little extra logic to handle the situation where theta might end up being less than 0 or greater than 1, as this will not result in a valid probability. Finding a function that takes values less than 0 and transforms them to 0 while simultaneously taking values greater than 1 and transforming them to 1 is a little tricky. It turns out that the `step` function does exactly that.

```{stan, output.var = "bin_norm1_prior", cache = TRUE}
data {
    int<lower = 0> N;
    array[N] int<lower = 0, upper = 1> y;
}
generated quantities {
    real theta;
    array[N] int y_sim;
    
    theta = normal_rng(0.3, 0.1);  // Draw randomly from the prior
    for(n in 1:N) {
        if(theta < 0 || theta > 1)
            y_sim[n] = bernoulli_rng(step(theta));
        else
            y_sim[n] = bernoulli_rng(theta);
    }
}
```

```{r, cache = TRUE}
fit_bin_norm1_prior <- sampling(bin_norm1_prior,
                                data = stan_data,
                                chains = 1,
                                algorithm = "Fixed_param",
                                seed = 54321,
                                refresh = 0)
```

```{r}
mcmc_hist(fit_bin_norm1_prior, pars = "theta")
```

Yep, that looks like a normal distribution.

Pull out the `y_sim` columns as before:

```{r}
y_sim_bin_norm1_prior <- tidy_draws(fit_bin_norm1_prior) %>%
    dplyr::select(starts_with("y_sim")) %>%
    as.matrix()
```

Looking at the first 20 draws:

```{r}
ppd_hist(ypred = y_sim_bin_norm1_prior[1:20,])
```

These are decidedly less "random" than with the uniform prior. The $N(0.3, 0.1)$ prior here means that we expect a 30% "success" rate with only a little deviation from that. The prior predictive distribution represented above does, indeed, show that most simulated samples have a larger bar over 0 than 1.


### PPC (Posterior Predictive Check)

As before, we modify the Stan model by adding a `generated quantities` block for simulating the new data given parameters values in the posterior.

```{stan, output.var = "bin_norm1", cache = TRUE}
data {
    int<lower = 0> N;
    array[N] int<lower = 0, upper = 1> y;
}
parameters {
    real<lower = 0, upper = 1> theta;
}
model {
    theta ~ normal(0.3, 0.1);   // prior
    y ~ bernoulli(theta);       // likelihood
}
generated quantities {
    array[N] int y_sim;
    
    for(n in 1:N) {
        if(theta < 0 || theta > 1)
            y_sim[n] = bernoulli_rng(step(theta));
        else
            y_sim[n] = bernoulli_rng(theta);
    }
}
```

```{r, cache = TRUE}
fit_bin_norm1 <- sampling(bin_norm1,
                          data = stan_data,
                          seed = 54321,
                          refresh = 0)
```

Here is the posterior distribution:

```{r}
mcmc_hist(fit_bin_norm1, pars = "theta")
```

Extract the simulated data sets:

```{r}
y_sim_bin_norm1 <- tidy_draws(fit_bin_norm1) %>%
    dplyr::select(starts_with("y_sim")) %>%
    as.matrix()
```

Plot the PPC:

```{r}
ppc_hist(y = y1,
         yrep = y_sim_bin_norm1[1:19,])
```

We can see in the upper-left corner that the original data had a 66% success rate. Note that most (but not all) data sets simulated from the posterior seem to have a higher failure rate. It's no mystery why: the prior is quite strong and there isn't much data, so the posterior distribution still puts a lot more probability over the region $\theta < 0.5$. Therefore, most of the simulated data will have a higher number of failures.

The intervals below show the same problem:

```{r}
ppc_bars(y = y1,
         yrep = y_sim_bin_norm1)
```

The simulated data is still much closer to the prior than the actual data.

In this case, we could graph the posterior distribution directly and see that it was closer to the prior than the likelihood. However, in more complex models, it may not be easy to visualize the posterior distribution directly. It will become increasingly important, then, that we perform a PPC to ensure that our posterior is reasonable. The PPC above shows us that our posterior *disagrees* with the data, which indicates a problem with our prior, or with our data, or both. In this case, we had a reasonably strong prior and not much data. The PPC alerts us to the fact that this might be an issue if we are trying to achieve reliable inference.