From 35ed46cb0d6426fe495bd2c7fa41be0e792110b2 Mon Sep 17 00:00:00 2001 From: Charles Margossian Date: Thu, 5 Dec 2024 18:00:14 -0500 Subject: [PATCH 1/2] update time series section on hmm. --- src/stan-users-guide/time-series.qmd | 374 ++++++--------------------- 1 file changed, 72 insertions(+), 302 deletions(-) diff --git a/src/stan-users-guide/time-series.qmd b/src/stan-users-guide/time-series.qmd index 61a52dff9..bcb22e01a 100644 --- a/src/stan-users-guide/time-series.qmd +++ b/src/stan-users-guide/time-series.qmd @@ -319,7 +319,7 @@ model { The error terms $\epsilon_t$ are defined as transformed parameters in terms of the observations and parameters. The definition of the -distribution statement (which also defines the likelihood) follows the +distribution statement (which also defines the likelihood) follows the definition, which can only be applied to $y_n$ for $n > Q$. In this example, the parameters are all given Cauchy (half-Cauchy for $\sigma$) priors, @@ -634,334 +634,104 @@ time of the original iterations. ## Hidden Markov models {#hmms.section} -A hidden Markov model (HMM) generates a sequence of $T$ output -variables $y_t$ conditioned on a parallel sequence of latent -categorical state variables $z_t \in \{1,\ldots, K\}$. These -"hidden" state variables are assumed to form a Markov chain so that -$z_t$ is conditionally independent of other variables given $z_{t-1}$. -This Markov chain is parameterized by a transition matrix $\theta$ -where $\theta_k$ is a $K$-simplex for $k \in \{ 1, \dotsc, K \}$. The -probability of transitioning to state $z_t$ from state $z_{t-1}$ is +A Hidden Markov model is a probabilistic +model over $N$ observtions $y_{1:N}$ and $N$ hidden states $x_{1:N}$, +defined by the conditional distributions $p(y_n \mid x_n, \phi)$ and +$p(x_n \mid x_{n-1}, \phi)$. Here we make the dependency on additional +model parameters $\phi$ explicit. +When $x_{1:N}$ is continuous, the user can explicitly encode these distributions +in Stan and use Markov chain Monte Carlo to integrate $x$ out. + +When each state $x$ takes a value over a discrete and finite set, +say $\{1, 2, ..., K\}$, we can use Stan's suite of HMM functions to marginalize +out $x_{1:N}$ and compute $p(y_{1:N} \mid \phi)$. +We start by defining the conditional observation distribution, stored in a +$K \times N$ matrix $\mega$ with $$ -z_t \sim \textsf{categorical}(\theta_{z[t-1]}). +\omega_{kn} = p(y_n \mid x_n = k, \phi). $$ -The output $y_t$ at time $t$ is generated conditionally independently -based on the latent state $z_t$. - -This section describes HMMs with a simple categorical model for -outputs $y_t \in \{ 1, \dotsc, V \}$. The categorical distribution for -latent state $k$ is parameterized by a $V$-simplex $\phi_k$. The -observed output $y_t$ at time $t$ is generated based on the hidden -state indicator $z_t$ at time $t$, +Next, we introduce the $K \times K$ transition matrix, $\Gamma$, with $$ -y_t \sim \textsf{categorical}(\phi_{z[t]}). +\Gamma_{ij} = p(x_n = h \mid x_{n - 1} = i, \phi). +$$ +Finally, we define the initial state K_vectir $\rho$, with +$$ + \rho_k = p(x_0 = k \mid \phi). $$ -In short, HMMs form a discrete mixture model where the mixture -component indicators form a latent Markov chain. - - - -### Supervised parameter estimation {-} -In the situation where the hidden states are known, the following -naive model can be used to fit the parameters $\theta$ and $\phi$. +As an example, consider a three-state model, with $K = 1, 2, 3$. +We assume that all transitions are possible, except from 1 to 3, +and 3 to 1, thereby setting two elements of the transition matrix +$\Gamma$ to 0. +The observations are normally distributed with +$$ + y \sim \text{normal}(\mu_k, \sigma), +$$ +where $\mu = (1, 5, 9)$. +The model is then ```stan data { - int K; // num categories - int V; // num words - int T; // num instances - array[T] int w; // words - array[T] int z; // categories - vector[K] alpha; // transit prior - vector[V] beta; // emit prior -} -parameters { - array[K] simplex[K] theta; // transit probs - array[K] simplex[V] phi; // emit probs -} -model { - for (k in 1:K) { - theta[k] ~ dirichlet(alpha); - } - for (k in 1:K) { - phi[k] ~ dirichlet(beta); - } - for (t in 1:T) { - w[t] ~ categorical(phi[z[t]]); - } - for (t in 2:T) { - z[t] ~ categorical(theta[z[t - 1]]); - } + int N; // Number of observations + array[N] real y; } -``` - -Explicit Dirichlet priors have been provided for $\theta_k$ and -$\phi_k$; dropping these two statements would implicitly take the -prior to be uniform over all valid simplexes. - -### Start-state and end-state probabilities {-} - -Although workable, the above description of HMMs is incomplete because -the start state $z_1$ is not modeled (the index runs from 2 to $T$). -If the data are conceived as a subsequence of a long-running process, -the probability of $z_1$ should be set to the stationary state -probabilities in the Markov chain. In this case, there is no distinct -end to the data, so there is no need to model the probability that the -sequence ends at $z_T$. -An alternative conception of HMMs is as models of finite-length -sequences. For example, human language sentences have distinct -starting distributions (usually a capital letter) and ending -distributions (usually some kind of punctuation). The simplest way to -model the sequence boundaries is to add a new latent state $K+1$, -generate the first state from a categorical distribution with -parameter vector $\theta_{K+1}$, and restrict the transitions so that -a transition to state $K+1$ is forced to occur at the end of the -sentence and is prohibited elsewhere. - -### Calculating sufficient statistics {-} - -The naive HMM estimation model presented above can be sped up -dramatically by replacing the loops over categorical distributions -with a single multinomial distribution. - -The data are declared as before. The transformed data block -computes the sufficient statistics for estimating the transition and -emission matrices. - -```stan -transformed data { - array[K, K] int trans; - array[K, V] int emit; - for (k1 in 1:K) { - for (k2 in 1:K) { - trans[k1, k2] = 0; - } - } - for (t in 2:T) { - trans[z[t - 1], z[t]] += 1; - } - for (k in 1:K) { - for (v in 1:V) { - emit[k, v] = 0; - } - } - for (t in 1:T) { - emit[z[t], w[t]] += 1; - } -} -``` +parameters { + // Rows of the transition matrix + simplex[2] t1; + simplex[3] t2; + simplex[2] t3; -The data model component based on looping over the input -is replaced with multinomials as follows. + // Initial state + simplex[3] rho; -```stan -model { - // ... - for (k in 1:K) { - trans[k] ~ multinomial(theta[k]); - } - for (k in 1:K) { - emit[k] ~ multinomial(phi[k]); - } + // Parameters of measurement model + vector[3] mu; + real sigma; } -``` -In a continuous HMM with normal emission probabilities could be sped -up in the same way by computing sufficient statistics. - -### Analytic posterior {-} - -With the Dirichlet-multinomial HMM, the posterior can be computed analytically because the Dirichlet is the conjugate prior to the multinomial. The following example illustrates how a Stan model can define the posterior analytically. This is possible in the Stan language because the model only needs to define the conditional probability of the parameters given the data up to a proportion, which can be done by defining the (unnormalized) joint probability or the (unnormalized) conditional posterior, or anything in between. +transformed parameters { + matrix[3, 3] gamma = rep_matrix(0.0, 3, 3); + matrix[3, N] log_omega; -The model has the same data and parameters as the previous models, but -now computes the posterior Dirichlet parameters in the transformed -data block. + // Build the transition matrix + gamma[1, 1:2] = t1; + gamma[2, ] = t2; + gamma[3, 2:3] t3; -```stan -transformed data { - vector[K] alpha_post[K]; - vector[V] beta_post[K]; - for (k in 1:K) { - alpha_post[k] = alpha; - } - for (t in 2:T) { - alpha_post[z[t - 1], z[t]] += 1; - } - for (k in 1:K) { - beta_post[k] = beta; - } - for (t in 1:T) { - beta_post[z[t], w[t]] += 1; + // Compute the log likelihoods in each possible state + for (n in 1:N) { + log_omega[1, n] = normal_lpdf(y[n] | mu[1], sigma); + log_omega[2, n] = normal_lpdf(y[n] | mu[2], sigma); + log_omega[3, n] = normal_lpdf(y[n] | mu[3], sigma); } } -``` - -The posterior can now be written analytically as follows. -```stan model { - for (k in 1:K) { - theta[k] ~ dirichlet(alpha_post[k]); - } - for (k in 1:K) { - phi[k] ~ dirichlet(beta_post[k]); - } -} -``` - + // prior + mu ~ normal(0, 1); + sigma ~ normal(0, 1); -### Semisupervised estimation {-} - -HMMs can be estimated in a fully unsupervised fashion without any data -for which latent states are known. The resulting posteriors are -typically extremely multimodal. An intermediate solution is to use -semisupervised estimation, which is based on a combination of -supervised and unsupervised data. Implementing this estimation -strategy in Stan requires calculating the probability of an output -sequence with an unknown state sequence. This is a marginalization -problem, and for HMMs, it is computed with the so-called forward -algorithm. - -In Stan, the forward algorithm is coded as follows. First, two additional data variable are declared for the unsupervised data. - -```stan -data { - // ... - int T_unsup; // num unsupervised items - array[T_unsup] int u; // unsup words - // ... + // Increment target by p(y | mu, sigma, Gamma, rho) + target += hmm_marginal(log_omega, Gamma, rho); } ``` -The model for the supervised data does not change; the unsupervised -data are handled with the following Stan implementation of the forward -algorithm. - -```stan -model { - // ... - array[K] real acc; - array[T_unsup, K] real gamma; - for (k in 1:K) { - gamma[1, k] = log(phi[k, u[1]]); - } - for (t in 2:T_unsup) { - for (k in 1:K) { - for (j in 1:K) { - acc[j] = gamma[t - 1, j] + log(theta[j, k]) - + log(phi[k, u[t]]); - } - gamma[t, k] = log_sum_exp(acc); - } - } - target += log_sum_exp(gamma[T_unsup]); -} -``` - -The forward values `gamma[t, k]` are defined to be the log -marginal probability of the inputs `u[1],...,u[t]` up to time -`t` and the latent state being equal to `k` at time -`t`; the previous latent states are marginalized out. The first -row of `gamma` is initialized by setting `gamma[1, k]` equal -to the log probability of latent state `k` generating the first -output `u[1]`; as before, the probability of the first latent -state is not itself modeled. For each subsequent time `t` and -output `j`, the value `acc[j]` is set to the probability of -the latent state at time `t-1` being `j`, plus the log -transition probability from state `j` at time `t-1` to state -`k` at time `t`, plus the log probability of the output -`u[t]` being generated by state `k`. The -`log_sum_exp` operation just multiplies the probabilities for -each prior state `j` on the log scale in an arithmetically stable -way. - -The brackets provide the scope for the local variables `acc` and -`gamma`; these could have been declared earlier, but it is -clearer to keep their declaration near their use. - - -### Predictive inference {-} - -Given the transition and emission parameters, $\theta_{k, k'}$ and -$\phi_{k,v}$ and an observation sequence $u_1, \dotsc, u_T \in \{ -1, \dotsc, V \}$, the Viterbi (dynamic programming) algorithm -computes the state sequence which is most likely to have generated the -observed output $u$. - -The Viterbi algorithm can be coded in Stan in the generated quantities -block as follows. The predictions here is the most likely state -sequence `y_star[1], ..., y_star[T_unsup]` underlying the -array of observations `u[1], ..., u[T_unsup]`. Because this -sequence is determined from the transition probabilities -`theta` and emission probabilities `phi`, it may be -different from sample to sample in the posterior. +The last function `hmm_marginal` takes in all the ingredients of the HMM, +and computes the relevant marginal distribution. +Finally, we can recover samples from the posterior distribution of the +hidden states in generated quantities. +It is also possible to compute the posterior probbability of each hidden state. ```stan generated quantities { - array[T_unsup] int y_star; - real log_p_y_star; - { - array[T_unsup, K] int back_ptr; - array[T_unsup, K] real best_logp; - real best_total_logp; - for (k in 1:K) { - best_logp[1, k] = log(phi[k, u[1]]); - } - for (t in 2:T_unsup) { - for (k in 1:K) { - best_logp[t, k] = negative_infinity(); - for (j in 1:K) { - real logp; - logp = best_logp[t - 1, j] - + log(theta[j, k]) + log(phi[k, u[t]]); - if (logp > best_logp[t, k]) { - back_ptr[t, k] = j; - best_logp[t, k] = logp; - } - } - } - } - log_p_y_star = max(best_logp[T_unsup]); - for (k in 1:K) { - if (best_logp[T_unsup, k] == log_p_y_star) { - y_star[T_unsup] = k; - } - } - for (t in 1:(T_unsup - 1)) { - y_star[T_unsup - t] = back_ptr[T_unsup - t + 1, - y_star[T_unsup - t + 1]]; - } - } + array[N] int latent_states = hmm_latent_rng(log_omega, Gamma, rho); + matrix[3, N] hidden_probs = hmm_hidden_state_prob(log_omega, Gamma, rho); } ``` +Note that the probabilities from `hmm_hidden_state_prob` are the marginal +probabilities of each state, and cannot be used to sample jointly all the +hidden states. Instead, users should rely on `hmm_latent_rng`. -The bracketed block is used to make the three variables -`back_ptr`, `best_logp`, and `best_total_logp` -local so they will not be output. The variable `y_star` will -hold the label sequence with the highest probability given the input -sequence `u`. Unlike the forward algorithm, where the -intermediate quantities were total probability, here they consist of -the maximum probability `best_logp[t, k]` for the sequence up to -time `t` with final output category `k` for time `t`, -along with a backpointer to the source of the link. Following the -backpointers from the best final log probability for the final time -`t` yields the optimal state sequence. - -This inference can be run for the same unsupervised outputs `u` -as are used to fit the semisupervised model. The above code can be -found in the same model file as the unsupervised fit. This is the -Bayesian approach to inference, where the data being reasoned about is -used in a semisupervised way to train the model. It is not -"cheating" because the underlying states for `u` are never -observed --- they are just estimated along with all of the other -parameters. - -If the outputs `u` are not used for semisupervised estimation but -simply as the basis for prediction, the result is equivalent to what -is represented in the BUGS modeling language via the cut operation. -That is, the model is fit independently of `u`, then those -parameters used to find the most likely state to have generated -`u`. - +For more details, since the corresponding case study: +https://mc-stan.org/users/documentation/case-studies/hmm-example.html. From 38c48dc6ae2f41cc326e07f7d4d2a9158dbb53d4 Mon Sep 17 00:00:00 2001 From: Charles Margossian Date: Fri, 6 Dec 2024 15:46:30 -0500 Subject: [PATCH 2/2] correct typos in hmm documents. --- src/stan-users-guide/time-series.qmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/stan-users-guide/time-series.qmd b/src/stan-users-guide/time-series.qmd index bcb22e01a..4e4850b0e 100644 --- a/src/stan-users-guide/time-series.qmd +++ b/src/stan-users-guide/time-series.qmd @@ -646,7 +646,7 @@ When each state $x$ takes a value over a discrete and finite set, say $\{1, 2, ..., K\}$, we can use Stan's suite of HMM functions to marginalize out $x_{1:N}$ and compute $p(y_{1:N} \mid \phi)$. We start by defining the conditional observation distribution, stored in a -$K \times N$ matrix $\mega$ with +$K \times N$ matrix $\omega$ with $$ \omega_{kn} = p(y_n \mid x_n = k, \phi). $$ @@ -654,7 +654,7 @@ Next, we introduce the $K \times K$ transition matrix, $\Gamma$, with $$ \Gamma_{ij} = p(x_n = h \mid x_{n - 1} = i, \phi). $$ -Finally, we define the initial state K_vectir $\rho$, with +Finally, we define the initial state $K$-vector $\rho$, with $$ \rho_k = p(x_0 = k \mid \phi). $$ @@ -712,13 +712,13 @@ model { mu ~ normal(0, 1); sigma ~ normal(0, 1); - // Increment target by p(y | mu, sigma, Gamma, rho) + // Increment target by log p(y | mu, sigma, Gamma, rho) target += hmm_marginal(log_omega, Gamma, rho); } ``` The last function `hmm_marginal` takes in all the ingredients of the HMM, -and computes the relevant marginal distribution. +and computes the relevant log marginal distribution. Finally, we can recover samples from the posterior distribution of the hidden states in generated quantities. It is also possible to compute the posterior probbability of each hidden state. @@ -734,4 +734,4 @@ probabilities of each state, and cannot be used to sample jointly all the hidden states. Instead, users should rely on `hmm_latent_rng`. For more details, since the corresponding case study: -https://mc-stan.org/users/documentation/case-studies/hmm-example.html. +[https://mc-stan.org/users/documentation/case-studies/hmm-example.html](https://mc-stan.org/users/documentation/case-studies/hmm-example.html).