From d44fd0939edb468c041c0b001ff65dc20df36a72 Mon Sep 17 00:00:00 2001
From: Simeon Carstens <simeon.carstens@tweag.io>
Date: Fri, 25 Oct 2019 12:04:17 +0200
Subject: [PATCH] Updated notebook with @aspiwack's and @robert.buckley's
 suggestions

---
 mcmc-intro/mcmc_introduction.ipynb | 93 +++++++++++++++++-------------
 1 file changed, 52 insertions(+), 41 deletions(-)

diff --git a/mcmc-intro/mcmc_introduction.ipynb b/mcmc-intro/mcmc_introduction.ipynb
index 0c223f8..9f11421 100644
--- a/mcmc-intro/mcmc_introduction.ipynb
+++ b/mcmc-intro/mcmc_introduction.ipynb
@@ -4,27 +4,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Markov Chain Monte Carlo (MCMC) sampling, part I: the basics"
+    "# Markov chain Monte Carlo (MCMC) sampling, part 1: the basics"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Markov Chain Monte Carlo (MCMC) is a powerful class of methods to sample from probability distributions known only up to an (unknown) normalization constant.\n",
-    "But before we dive into MCMC, let's consider why you might want to do sampling in the first place.  \n",
-    "The answer to that is: whenever you need to fully characterize a distribution because you're either interested in the samples themselves or because you need them to approximate expectations of functions w.r.t. to a probability distribution.\n",
-    "Sometimes, only the mode of a probability distribution is of primary interest. In this case, it can be obtained by numerical optimization and full sampling is not necessary. \n",
-    "It turns out that sampling from any but the most basic probability distributions is a difficult task, because for those, you usually don't know the normalization constant.\n",
-    "You then can't easily get random numbers distributed according to this distribution: [inverse transform sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling) requires proper, normalized distributions.\n",
-    "Now in principle you could just obtain the normalization constant by numerical integration, but this quickly gets infeasible with increasing number of dimensions.\n",
-    "[Rejection sampling](https://en.wikipedia.org/wiki/Rejection_sampling) does not require normalized distributions, but implementing it efficiently can be very hard and requires a good deal of knowledge about the distribution of interest.\n",
-    "That's when you need a smart way to obtain representative samples from your distribution which doesn't require knowledge of the normalization constant.  \n",
-    "MCMC algorithms are a class of methods which do exactly that.\n",
-    "These methods date back to a seminal paper by Metropolis et al., who developed the first MCMC algorithm, correspondingly called [Metropolis algorithm](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm), to calculate the equation of state of a two-dimensional system of hard spheres, but really were looking for a general method to calculate expectation values occurring in statistical physics.   \n",
+    "[Markov chain Monte Carlo (MCMC)](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) is a powerful class of methods to sample from probability distributions known only up to an (unknown) normalization constant. But before we dive into MCMC, let's consider why you might want to do sampling in the first place.\n",
     "\n",
-    "In this series of blog posts, I will introduce several important, increasingly complex MCMC algorithms.\n",
-    "Along the way, I hope you will acquire a solid understanding of challenges you may face when sampling from non-standard probability distributions and how to address them with the methods presented here.   "
+    "The answer to that is: whenever you're either interested in the samples themselves (for example, inferring unknown parameters in Bayesian inference) or you need them to approximate expected values of functions w.r.t. to a probability distribution (for example, calculating thermodynamic quantities from the distribution of microstates in statistical physics). Sometimes, only the mode of a probability distribution is of primary interest. In this case, it's obtained by numerical optimization so full sampling is not necessary.\n",
+    "\n",
+    "It turns out that sampling from any but the most basic probability distributions is a difficult task. [Inverse transform sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling) is an elementary method to sample from probability distributions, but requires the cumulative distribution function, which in turn requires knowledge of the, generally unknown, normalization constant. Now in principle, you could just obtain the normalization constant by numerical integration, but this quickly gets infeasible with an increasing number of dimensions. [Rejection sampling](https://en.wikipedia.org/wiki/Rejection_sampling) does not require a normalized distribution, but efficiently implementing it requires a good deal of knowledge about the distribution of interest, and it suffers strongly from the curse of dimension, meaning that its efficiency decreases rapidly with an increasing number of variables. That's when you need a smart way to obtain representative samples from your distribution which doesn't require knowledge of the normalization constant.\n",
+    "\n",
+    "MCMC algorithms are a class of methods which do exactly that. These methods date back to a [seminal paper by Metropolis et al.](https://pdfs.semanticscholar.org/7b3d/c9438227f747e770a6fb6d7d7c01d98725d6.pdf), who developed the first MCMC algorithm, correspondingly called [Metropolis algorithm](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm), to calculate the equation of state of a two-dimensional system of hard spheres. In reality, they were looking for a general method to calculate expected values occurring in statistical physics.\n",
+    "\n",
+    "In this blog post, I introduce the basics of MCMC sampling; in subsequent posts I'll cover several important, increasingly complex and powerful MCMC algorithms, which all address different difficulties one frequently faces when using the Metropolis-Hastings algorithm. Along the way, you will gain a solid understanding of these challenges and how to address them. Also, this serves as a reference for MCMC methods in the context of the [monad-bayes](https://www.tweag.io/posts/2019-09-20-monad-bayes-1.html) series. Furthermore, I hope the provided notebooks will not only spark your interest in exploring the behavior of MCMC algorithms for various parameters/probability distributions, but also serve as a basis for implementing and understanding useful extensions of the basic versions of the algorithms I present."
    ]
   },
   {
@@ -40,10 +35,11 @@
    "source": [
     "Now that we know why we want to sample, let's get to the heart of MCMC &mdash; Markov chains.\n",
     "What is a Markov chain?\n",
-    "Really informally and without all the technical details, a Markov chain is a random sequence of states in some state space in which the probability of picking a certain state next depends only on the current state in the chain and not on the previous history: it is memory-less.\n",
+    "Without all the technical details, a Markov chain is a random sequence of states in some state space in which the probability of picking a certain state next depends only on the current state in the chain and not on the previous history: it is memory-less.\n",
     "Under certain conditions, a Markov chain has a unique stationary distribution of states to which it will converge after a certain number of states.\n",
     "From that number on, states in the Markov chain will be distributed according to the invariant distribution.\n",
-    "MCMC algorithms work by constructing a Markov chain with the probability distribution you want to sample from as the stationary distribution."
+    "MCMC algorithms work by constructing a Markov chain with the probability distribution you want to sample from as the stationary distribution.\n",
+    "In order to sample from a distribution $\\pi(x)$, a MCMC algorithm constructs and simulates a Markov chain whose stationary distribution is $\\pi(x)$, meaning that, after an initial \"burn-in\" phase, the states of that Markov chain are distributed according to $\\pi(x)$. We thus just have to store the states to obtain samples from $\\pi(x)$."
    ]
   },
   {
@@ -58,8 +54,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now just for fun (and for illustration), let's quickly whip up a Markov chain which has a unique stationary distribution.\n",
-    "We'll start with some imports and settings for the plots:"
+    "Now just for fun (and for illustration), let's quickly whip up a Markov chain which has a unique stationary distribution. We'll start with some imports and settings for the plots:"
    ]
   },
   {
@@ -117,7 +112,7 @@
    "metadata": {},
    "source": [
     "The rows indicate the states the chain might currently be in and the columns the states the chains might transition to.\n",
-    "If we take one \"time\" step of the Markov chain as one hour, then, if it's sunny, there's a 60% chance it stays sunny in the next hour, a 30% chance that in the next hour we will have cloudy weather and only a 10% chance of it raining immediately after it had been sunny before.\n",
+    "If we take one \"time\" step of the Markov chain as one hour, then, if it's sunny, there's a 60% chance it stays sunny in the next hour, a 30% chance that in the next hour we will have cloudy weather and only a 10% chance of rain immediately after it had been sunny before.\n",
     "This also means that each row has to sum up to one."
    ]
   },
@@ -195,20 +190,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "So all that is fun, but back to sampling an arbitrary probability distribution $\\pi$. \n",
-    "It could either be discrete, in which case we would keep talking about a transition matrix $T$, or it could be continous, so that $T$ would be a transition *kernel*.\n",
+    "So that's lots of fun, but back to sampling an arbitrary probability distribution $\\pi$. \n",
+    "It could either be discrete, in which case we would keep talking about a transition matrix $T$, or it continous, in which case $T$ would be a transition *kernel*.\n",
     "From now on, we're considering continuous distributions, but all concepts presented here transfer to the discrete case.  \n",
     "If we could design the transition kernel in such a way that the next state is already drawn from $\\pi$, we would be done, as our Markov Chain would... well...  immediately sample from $\\pi$.\n",
     "Unfortunately, to do this, we need to be able to sample from $\\pi$, which we can't.\n",
     "Otherwise you wouldn't be reading this, right?  \n",
-    "A way around this is to split up the transition kernel $T(x_{i+1}|x_i)$ into two parts:\n",
-    "a proposal distribution $q(x_{i+1}|x_i)$, from which we can sample possible next states, and an acceptance probability $p_\\mathrm{acc}(x_{i+1}|x_i)$ which determines if a proposal $x_{i+1}$ is accepted as the next state in the chain or not.\n",
-    "This acceptance / rejection step corrects for the error introduced by proposal states from $q \\neq \\pi$.\n",
+    "A way around this is to split the transition kernel $T(x_{i+1}|x_i)$ into two parts:\n",
+    "a proposal step and an acceptance/rejection step. The proposal step features a proposal distribution $q(x_{i+1}|x_i)$, from which we can sample possible next states of the chain. In addition to being able to sample from it, we can choose this distribution arbitrarily. But, one should strive to design it such that samples from it are both as little correlated with the current state as possible and have a good chance of being accepted in the acceptance step. Said acceptance/rejection step is the second part of the transition kernel and corrects for the error introduced by proposal states drawn from $q \\neq \\pi$. It involves calculating an acceptance probability $p_\\mathrm{acc}(x_{i+1}|x_i)$ and accepting the proposal $x_{i+1}$ with that probability as the next state in the chain. Drawing the next state $x_{i+1}$ from $T(x_{i+1}|x_i)$ is then done as follows: first, a proposal state $x_{i+1}$ is drawn from $q(x_{i+1}|x_i)$. It is then accepted as the next state with probability \n",
+    "$p_\\mathrm{acc}(x_{i+1}|x_i)$ or rejected with probability $1 - p_\\mathrm{acc}(x_{i+1}|x_i)$, in which case the current state is copied as the next state.\n",
+    "\n",
     "We thus have \n",
     "$$\n",
     "T(x_{i+1}|x_i)=q(x_{i+1} | x_i) \\times p_\\mathrm{acc}(x_{i+1}|x_i) \\ \\mbox .\n",
     "$$\n",
-    "A sufficient condition for a Markov chain to have $\\pi$ as its stationary distribution is the transition kernel obeying *detailed balance* or, more intuitively, *microscopic reversibility*:\n",
+    "A sufficient condition for a Markov chain to have $\\pi$ as its stationary distribution is the transition kernel obeying *detailed balance* or, in the physics literature, *microscopic reversibility*:\n",
     "$$\n",
     "\\pi(x_i) T(x_{i+1}|x_i) = \\pi(x_{i+1}) T(x_i|x_{i+1})\n",
     "$$\n",
@@ -219,26 +215,31 @@
     "$$\n",
     "p_\\mathrm{acc}(x_{i+1}|x_i) = \\mathrm{min} \\left\\{1, \\frac{\\pi(x_{i+1}) \\times q(x_i|x_{i+1})}{\\pi(x_i) \\times q(x_{i+1}|x_i)} \\right\\} \\ \\mbox .\n",
     "$$\n",
-    "Often, symmetric proposal distributions with $q(x_i|x_{i+1})=q(x_{i+1}|x_i)$ are used, in which case the Metropolis-Hastings algorithm reduces to the original, but less general Metropolis algorithm developed in 1953 and for which\n",
+    "Now here's where the magic happens: we know $\\pi$ only up to a constant, but it doesn't matter, because that unknown constant cancels out in the expression for $p_\\mathrm{acc}$! It is this property of $p_\\mathrm{acc}$ which makes algorithms based on Metropolis-Hastings work for unnormalized distributions. Often, symmetric proposal distributions with $q(x_i|x_{i+1})=q(x_{i+1}|x_i)$ are used, in which case the Metropolis-Hastings algorithm reduces to the original, but less general Metropolis algorithm developed in 1953 and for which\n",
     "$$\n",
     "p_\\mathrm{acc}(x_{i+1}|x_i) = \\mathrm{min} \\left\\{1, \\frac{\\pi(x_{i+1})}{\\pi(x_i)} \\right\\} \\ \\mbox .\n",
     "$$\n",
-    "Formally, we can then write the complete Metropolis-Hastings transition kernel as\n",
+    "We can then write the complete Metropolis-Hastings transition kernel as\n",
     "$$\n",
     "T(x_{i+1}|x_i) = \\begin{cases}\n",
     "                   q(x_{i+1}|x_i) \\times p_\\mathrm{acc}(x_{i+1}|x_i) &: x_{i+1} \\neq x_i \\mbox ; \\\\\n",
     "                   1 - \\int \\mathrm{d}x_{i+1} \\ q(x_{i+1}|x_i) \\times p_\\mathrm{acc}(x_{i+1}|x_i) &: x_{i+1} = x_i\\mbox .\n",
     "                 \\end{cases} \n",
-    "$$\n",
-    "A step is then performed as follows: a proposal state $x_{i+1}$ is drawn from $q(x_{i+1}|x_i)$.\n",
-    "It is then accepted as the next state with probability $p_\\mathrm{acc}(x_{i+1}|x_i)$ or rejected with probability $1-p_\\mathrm{acc}(x_{i+1}|x_i)$, in which case the current state is copied as the next state.  \n"
+    "$$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Implementing the Metropolis-Hastings algorithm in Python"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Alright, now that we now how Metropolis-Hastings works, let's go ahead and implement it.\n",
+    "All right, now that we know how Metropolis-Hastings works, let's go ahead and implement it.\n",
     "First, we set the log-probability of the distribution we want to sample from - without normalization constants, as we pretend we don't know them.\n",
     "Let's work for now with a standard normal distribution:"
    ]
@@ -258,7 +259,7 @@
    "metadata": {},
    "source": [
     "Next, we choose a symmetric proposal distribution.\n",
-    "In general, including information you have about the distribution you want to sample from in the proposal distribution will lead to better performance of the Metropolis-Hastings algorithm. \n",
+    "Generally, including information you have about the distribution you want to sample from in the proposal distribution will lead to better performance of the Metropolis-Hastings algorithm. \n",
     "A naive approach is to just take the current state $x$ and pick a proposal from $\\mathcal{U}(x-\\frac{\\Delta}{2}, x+\\frac{\\Delta}{2})$, that is, we set some step size $\\Delta$ and move left or right a maximum of $\\frac{\\Delta}{2}$ from our current state:"
    ]
   },
@@ -346,6 +347,13 @@
     "    return chain, acceptance_rate"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Testing our Metropolis-Hastings implementation and exploring its behavior"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -381,7 +389,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Alright.\n",
+    "All right.\n",
     "So did this work?\n",
     "We achieved an acceptance rate of around 71% and we have a chain of states.\n",
     "We should throw away the first few states during which the chain won't have converged to its stationary distribution yet.\n",
@@ -445,9 +453,8 @@
    "metadata": {},
    "source": [
     "Now, what's up with the parameters `stepsize` and `n_total`?\n",
-    "We'll discuss the step size first: it determines how big the random steps are the Markov chain takes.\n",
-    "If the step size is too large, we will jump around a lot in the tails of the distribution, where probability is low.\n",
-    "The Metropolis-Hastings sampler will reject most of these moves, meaning that the acceptance rate decreases and convergence is much slower.\n",
+    "We'll discuss the step size first: it determines how far away a proposal state can be from the current state of the chain. It is thus a parameter of the proposal distribution $q$ and controls how big the random steps are which the Markov chain takes. If the step size is too large, the proposal state will often be in the tails of the distribution, where probability is low.\n",
+    "The Metropolis-Hastings sampler rejects most of these moves, meaning that the acceptance rate decreases and convergence is much slower.\n",
     "See for yourself:"
    ]
   },
@@ -493,9 +500,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Not so cool, right?\n",
+    "Not as cool, right?\n",
     "Now you could think the best thing to do is do choose a tiny step size.\n",
-    "Turns out that this is not too smart, either, because then the Markov chain will explore the probability distribution only very slowly and thus again won't converge as rapidly as with a well-adjusted step size:"
+    "Turns out that this is not too smart either because then the Markov chain will explore the probability distribution only very slowly and thus again won't converge as rapidly as with a well-adjusted step size:"
    ]
   },
   {
@@ -533,7 +540,7 @@
    "source": [
     "No matter how you choose the step size parameter, the Markov chain will eventually converge to its stationary distribution.\n",
     "But it may take a looooong time.\n",
-    "The time we simulate the Markov chain for is set by the `n_total` parameter - it simply determines how many states we will end up with.\n",
+    "The time we simulate the Markov chain for is set by the `n_total` parameter - it simply determines how many states of the Markov chain (and thus samples) we'll end up with.\n",
     "If the chain converges slowly, we need to increase `n_total` in order to give the Markov chain enough time to forget it's initial state.\n",
     "So let's keep the tiny step size and increase the number of samples by increasing `n_total`:"
    ]
@@ -587,14 +594,18 @@
    "source": [
     "With these considerations, I conclude the first blog post of this series.\n",
     "I hope you now understand the intuition behind the Metropolis-Hastings algorithm, its parameters and why it is an extremely useful tool to sample from non-standard probability distributions you might encounter out there in the wild.  \n",
-    "I highly encourage you to play around with the code in this notebook - this way you will learn how the algorithm behaves in various circumstances and deepen your understanding of it.\n",
+    "\n",
+    "I highly encourage you to play around with the code in this notebook - this way, you can learn how the algorithm behaves in various circumstances and deepen your understanding of it.\n",
     "Go ahead and try out a non-symmetric proposal distribution!\n",
     "What happens if you don't adjust the acceptance criterion accordingly?\n",
     "What happens if you try to sample from a bimodal distribution?\n",
     "Can you think of a way to automatically tune the stepsize?\n",
     "What are pitfalls here?\n",
     "Try out and discover yourself!  \n",
-    "In the next post, I will discuss the Gibbs sampler - a special case of the Metropolis-Hastings algorithm which allows you to approximately sample from a multivariate distribution by sampling from the conditional distributions."
+    "\n",
+    "In my next post, I will discuss the Gibbs sampler - a special case of the Metropolis-Hastings algorithm that allows you to approximately sample from a multivariate distribution by sampling from the conditional distributions.\n",
+    "\n",
+    "Thanks for reading&mdash;go forward and sample!"
    ]
   }
  ],