mdp.tex

\chapter{Markov Decision Processes}
\label{chap-mdps}
So far, most of the learning problems we have looked at have been {\em supervised}, that is, for each training input $\ex{x}{i}$, we
are told which value $\ex{y}{i}$ should be the output. From a traditional machine-learning viewpoint, there're two other major groups of learning problems: one is the {\em unsupervised} learning problems, in
which we are given data and no expected outputs, and we will look at later in Chapter \ref{chap-clustering} and Chapter \ref{chap-autoencoders}.

The other major type is the so-called {\em Reinforcement learning}\index{reinforcement learning} (RL) problems. Reinforcement learning differs significantly from supervised learning problems, and we will delve into the details later in in Chapter \ref{chap-reinforcement}. However, it's worth pointing out one major difference at a very high level: in supervised learning, our goal is to learn a one-time static mapping to make predictions, whereas in RL, the setup requires us to sequentially take actions to maximize cumulative rewards.

This setup change necessitates additional mathematical and algorithmic tools for us to understand RL. {\em Markov decision process}\index{Markov decision process} ({\sc mdp}) is precisely such a classical and fundamental tool.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Definition and value functions}
\label{sec_mdps}

Formally, a Markov decision process is $\langle \mathcal S, \mathcal A, T, R,
  \gamma\rangle$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, and:
\begin{itemize}
  \item
        $T : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \R$ is
        a {\em transition model}\index{transition model}, where
        \[T(s, a, s') = Pr(S_t = s'|S_{t - 1} = s,
          A_{t - 1} = a)\;\;,\] specifying a conditional probability distribution;
        \note{The notation $S_t = s'$ uses a capital letter $S$ to stand for
          a random variable, and small letter $s$ to stand for a concrete value.
          So $S_t$ here is a random variable that can take on elements of
          $\mathcal S$ as values.}
  \item
        $R: \mathcal S \times \mathcal A \rightarrow \R$ is a {\em reward function}\index{reward function},
        where $R(s, a)$ specifies an immediate reward for taking action $a$ when
        in state $s$; and
  \item $\gamma \in [0, 1]$ is a {\em discount factor}, which we'll
        discuss in Section~\ref{sec-discount}.
\end{itemize}
In this class, we assume the rewards are deterministic functions. Further, in this MDP chapter, we assume the state space and action space are finite (in fact, typically small).


\begin{examplebox}
  The following description of a simple machine as Markov decision
  process provides a concrete example of an MDP.
  %
  % An agent controls the actions taken, while the environment responds
  % with the transition to the next state.
  %
  The machine has three possible operations ({\em actions}): ``wash'',
  ``paint'', and ``eject'' (each with a corresponding button). Objects
  are put into the machine. Each time you push a button, something is
  done to the object. However, it's an old machine, so it's not very
  reliable. The machine has a camera inside that can clearly detect what
  is going on with the object and will output the state of the object:
  ``dirty'', ``clean'', ``painted'', or ``ejected''.  For each action,
  this is what is done to the object:

  \noindent {\bf Wash}:

  \begin{itemize}
    \item If you perform the ``wash'' operation on any object, whether
          it's dirty, clean, or painted, it will end up ``clean'' with
          probability 0.9 and ``dirty'' otherwise.
  \end{itemize}

  \noindent {\bf Paint}:

  \begin{itemize}
    \item If you perform the ``paint'' operation on a clean object, it
          will become nicely ``painted'' with probability 0.8. With
          probability 0.1, the paint misses but the object stays clean, and
          also with probability 0.1, the machine dumps rusty dust all over the
          object and it becomes ``dirty''.
    \item If you perform the ``paint'' operation on a ``painted'' object,
          it stays ``painted'' with probability 1.0.
    \item If you perform the ``paint'' operation on a ``dirty'' part, it
          stays ``dirty'' with probability 1.0.
  \end{itemize}

  \noindent {\bf Eject}:

  \begin{itemize}
    \item If you perform an ``eject'' operation on any part, the part
          comes out of the machine and this fun game is over. The part remains
          "ejected" regardless of any further action.
  \end{itemize}

  \noindent
  These descriptions specify the transition model $T$, and the
  transition function for each action can be depicted as a state machine
  diagram.  For example, here is the diagram for ``wash'':
  %
  % As for the state machine diagrams presented above, states are denoted
  % by large circles and actions by small black circles. The arc from a
  % state to an action is followed by the possible arcs (from the small
  % action circle) showing the probability of transition from the source
  % state to the indicated destination state.
  %
  \begin{center}
    \begin{tikzpicture}[->]
      \node[state] (dirty) {dirty};
      \node[state, right of=dirty, xshift=2.5cm] (clean) {clean};
      \node[state, below of=dirty, yshift=-2.5cm] (painted) {painted};
      \node[state, below of=clean, yshift=-2.5cm] (ejected) {ejected};
      \draw (dirty) edge[loop above] node{0.1} (dirty)
      (dirty) edge[bend left, above] node{0.9} (clean)
      (clean) edge[loop above] node{0.9} (clean)
      (clean) edge[bend left, below] node{0.1} (dirty)
      (painted) edge[left] node{0.1} (dirty)
      (painted) edge[right] node{~0.9} (clean)
      (ejected) edge[loop above] node{1.0} (ejected);
    \end{tikzpicture}
  \end{center}

  You get reward +10 for ejecting a painted object, reward 0 for
  ejecting a non-painted object, reward 0 for any action on an "ejected"
  object, and reward -3 otherwise.  The MDP description would be
  completed by also specifying a discount factor.

\end{examplebox}


A {\it{policy}}\index{policy} is a function $\pi$
that specifies what action to take in each state.  The policy is what
we will want to learn; it is akin to the strategy that a player
employs to win a given game.  Below, we take just the initial steps
towards this eventual goal.  We describe how to evaluate how good a
policy is, first in the {\em finite horizon} case
(Section~\ref{sec-mdp_finite_horizon}) when the total number of
transition steps is finite.  In the finite horizon case, we typically denote
the policy as $\pi_h(s)$, where $h\in\mathbb N$ is a non-negative integer
denoting the number of steps remaining and $s \in \mathcal S$ is the current
state. Then we consider the {\em infinite horizon}
case (Section~\ref{sec-mdp_infinite_horizon}), when you
don't know when the game will be over.

% There are important algorithms that can be used to find good
% policies, under certain assumptions.  These will be discussed in
% Chapter~\ref{chap-reinforcement}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Finite-horizon value functions}
\index{policy!finite horizon}
\label{sec-mdp_finite_horizon}

The goal of a policy is to maximize the expected total reward, averaged over
the stochastic transitions that the domain makes.  Let's first
consider the case where there is a finite {\em horizon} $H$,
indicating the total number of steps of interaction that the agent
will have with the {\sc mdp}\note{In the finite-horizon case, we usually
  set the discount factor $\gamma$ to 1.}.

We seek to measure the goodness of a
policy. We do so by defining for a given horizon $h$ and {\sc mdp} policy $\pi_h$, 
the ``horizon $h$ {\em value}'' of a state, $V^{h}_\pi(s)$.  We
do this by induction on the horizon, which is the {\em number of steps
    left to go}.

The base case is when there are no steps remaining, in which case, no
matter what state we're in, the value is 0,  so
\begin{equation}
  V^0_{\pi}(s) = 0\;\;.
  \label{eq:finite_val_0}
\end{equation}
Then, the value of a policy in state $s$ at horizon $h + 1$ is equal
to the reward it will get in state $s$ plus the next state's expected horizon $h$
value, discounted by a factor $\gamma$.  So, starting with horizons 1 and 2, and then
moving to the general case, we have:
\begin{align}
  V^1_{\pi}(s) & = R(s, \pi_1(s)) + 0                                                     \\
  \label{eq:finite_val_1}
  % V^2_{\pi}(s) & = R(s, \pi_2(s)) + \sum_{s'}T(s, \pi_2(s), s')  R(s', \pi(s'))      \\
  V^2_{\pi}(s) & = R(s, \pi_2(s)) +  \gamma \sum_{s'}T(s, \pi_2(s), s')  V^1_{\pi}(s')      \\
  \vdots
  \nonumber                                                                             \\
  V^h_{\pi}(s) & = R(s, \pi_h(s)) + \gamma \sum_{s'}T(s, \pi_h(s), s')  V^{h - 1}_{\pi}(s')
  \label{eq:finite_value}
\end{align}
The sum over $s'$ is an {\em expectation}\index{expectation}: it
considers all possible next states $s'$, and computes an average of
their $(h-1)$-horizon values, weighted by the probability that the
transition function from state $s$ with the action chosen by the
policy $\pi_h(s)$ assigns to arriving in state $s'$, and discounted by $\gamma$.

\question{What is $\sum_{s'} T(s, a, s')$ for any particular $s$ and $a$?}

\question{Convince yourself that Eqs.~\ref{eq:finite_val_0}
  and~\ref{eq:finite_val_1} are special cases of Eq.~\ref{eq:finite_value}.}

Then we can say that a policy $\pi$ is better than policy $\bar{\pi}$ for horizon
$h$ if and only if for all $s \in \mathcal S$,
$V_{\pi}^h(s) \geq V_{\bar{\pi}}^h(s)$ and there exists at least one $s
  \in \mathcal S$ such that $V_{\pi}^h(s) > V_{\bar{\pi}}^h(s)$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Infinite-horizon value functions}
\label{sec-mdp_infinite_horizon}
\label{sec-discount}
\index{policy!infinite horizon}

More typically, the actual finite horizon is not known, i.e., when you
don't know when the game will be over!  This is called the {\em
    infinite horizon} version of the problem.  How does one evaluate the
goodness of a policy in the infinite horizon case?

If we tried to simply take our definitions above and use them for an
infinite horizon, we could get in trouble. Imagine we get a reward of
1 at each step under one policy and a reward of 2 at each step under a
different policy. Then the reward as the number of steps grows in each
case keeps growing to become infinite in the limit of more and more
steps. Even though it seems intuitive that the second policy should
be better, we can't justify that by saying $\infty < \infty$.

One standard approach to deal with this problem is to consider the
\emph{discounted} infinite horizon\index{policy!infinite horizon!discount factor}.
We will generalize from the finite-horizon case by adding a discount factor.

In the finite-horizon case, we valued a
policy based on an expected finite-horizon value:
\begin{equation}
  \mathbb{E}\left[\sum_{t = 0}^{h-1} \gamma^t R_t \mid \pi, s_0\right]\;\;,
  \label{eq:exp_finite}
\end{equation}
where $R_t$ is the reward received at time $t$.

\begin{examplebox}
  What is $\mathbb{E}\left[ \cdot \right ]$?  This mathematical
  notation indicates an {\em expectation}, i.e., an average
  taken over all the random possibilities which may occur for the
  argument.  Here, the expectation is taken over the {\em conditional
      probability} $Pr(R_t = r \mid \pi, s_0)$, where $R_t$ is the random
  variable for the reward, subject to the policy being $\pi$ and the
  state being $s_0$. Since $\pi$ is a function, this notation is
  shorthand for conditioning on all of the random variables implied by
  policy $\pi$ and the stochastic transitions of the MDP.

  \bigskip
  A very important point is that $R(s,a)$ is always deterministic (in this class) for
  any given $s$ and $a$. Here $R_t$ represents the set of all possible
  $R(s_t,a)$ at time step $t$; this $R_t$ is a random variable because
  the state we're in at step $t$ is itself a random variable, due to prior
  stochastic state transitions up to but not including at step $t$ and prior
  (deterministic) actions dictated by policy $\pi.$
\end{examplebox}

Now, for the infinite-horizon case, we select a discount factor $0 \leq \gamma \leq 1$,
and evaluate a policy based on its expected {\it{infinite horizon discounted value}}:
\begin{equation}
  \mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^tR_t \mid \pi, s_0\right]
  = \mathbb{E}\left[R_0  + \gamma R_1 + \gamma^2 R_2 + \ldots \mid \pi,  s_0\right] \;\;.
  \label{eq:exp_infinite}
\end{equation}
Note that the $t$ indices here are not the number of steps to go, but
actually the number of steps forward from the starting state (there is
no sensible notion  of ``steps to go'' in the infinite horizon case).

\begin{examplebox}
  Eqs.~\ref{eq:exp_finite} and~\ref{eq:exp_infinite} are a conceptual
  stepping stone. Our main objective is to get to
  Eq.~\ref{eq:inifite_horiz_value}, which can also be viewed
  as including $\gamma$ in Eq.~\ref{eq:finite_value}, with
  the appropriate definition of the infinite-horizon value.
\end{examplebox}

There are two good intuitive motivations for discounting.  One is
related to economic theory and the present value of money: you'd
generally rather have some money today than that same amount of money
next week (because you could use it now or invest it).  The other is
to think of the whole process terminating, with probability $1-\gamma$
on each step of the interaction\note{At every
  step, your expected future lifetime, given that you have survived
  until now, is $1 / (1 - \gamma)$.}.  This value is the expected amount of
reward the agent would gain under this terminating model.

\question{Verify this fact: if, on every day you wake up, there is a
  probability of $1 - \gamma$ that today will be your last day, then
  your expected lifetime is $1 / (1 - \gamma)$ days.  }

Let us now evaluate a policy in terms of the expected discounted
infinite-horizon value that the agent will get in the {\sc mdp} if it
executes that policy.  We define the infinite-horizon value of a state
$s$ under policy $\pi$ as
\begin{equation}
  V_{\pi}(s) = \mathbb{E}[R_0 + \gamma R_1 + \gamma^2 R_2 +
    \dots \mid \pi, S_0 = s] = \mathbb{E}[R_0 + \gamma(R_1 + \gamma(R_2 + \gamma
    \dots))) \mid \pi, S_0 = s] \;\;.
\end{equation}
Because the expectation of a linear combination of random variables is
the linear combination of the expectations, we have
\begin{align}
  V_{\pi}(s) & = \mathbb{E}[R_0 \mid \pi, S_0 = s] + \gamma  \mathbb{E}[
    R_1 + \gamma(R_2 + \gamma \dots))) \mid \pi, S_0 = s]
  \nonumber
  \\
             & = R(s, \pi(s)) + \gamma\sum_{s'}T(s, \pi(s), s')V_{\pi}(s') \; .
  \label{eq:inifite_horiz_value}
\end{align}\note{This is {\em so} cool!  In a discounted model, if you find that
  you survived this round and landed in some state $s'$, then you have
  the same expected future lifetime as you did before.  So the value
  function\index{value function} that is relevant in that state is
  exactly the same one as in state $s$.}

The equation defined in Eq.~\ref{eq:inifite_horiz_value} is known as the Bellman Equation,
which breaks down the value function into the immediate reward and the (discounted) future
value function.
You could write down one of these equations for each of the $n =
  |\mathcal S|$ states. There are $n$ unknowns $V_{\pi}(s)$.  These are
linear equations, and standard software (e.g., using Gaussian
elimination or other linear algebraic methods) will, in most cases,
enable us to find the value of each state under this policy.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Finding policies for MDPs}

\label{sec-finding_mdp_policies}

Given an {\sc mdp}, our goal is typically to find a policy that is
optimal in the sense that it gets as much total reward as possible, in
expectation over the stochastic transitions that the domain makes.  We
build on what we have learned about evaluating the goodness of a
policy (Sections~\ref{sec-mdp_finite_horizon}
and~\ref{sec-mdp_infinite_horizon}), and find optimal policies for the
finite horizon case (Section~\ref{sec-mdp_finite_horizon_optimal}),
then the infinite horizon case
(Section~\ref{sec-mdp_infinite_horizon_optimal}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Finding optimal finite-horizon policies}

\label{sec-mdp_finite_horizon_optimal}

How can we go about finding an optimal policy for an {\sc mdp}?
\index{policy!optimal policy} We could imagine enumerating all
possible policies and calculating their value functions\index{value
  function!calculating value function} as in the previous section and
picking the best one -- but that's too much work!

The first observation to make is that, in a finite-horizon problem,
the best action to take depends on the current state, but also on the
horizon: imagine that you are in a situation where you could reach a
state with reward 5 in one step or a state with reward 100 in two
steps.  If you have at least two steps to go, then you'd move toward
the reward 100 state, but if you only have one step left to go, you should
go in the direction that will allow you to gain 5!

One way to find an optimal policy is to compute an optimal {\em action-value function}, $Q$\index{Q-function}\index{optimal
  action-value function}.  For the finite-horizon case, we define
$Q^h(s, a)$ to be the expected value of
\begin{itemize}
  \item starting in state $s$,
  \item executing action $a$, and
  \item continuing for $h - 1$ more steps executing an optimal policy
        for the appropriate horizon on each step.
\end{itemize}
Similar to our definition of $V^h$ for evaluating a policy, we define
the $Q^h$ function recursively according to the horizon.  The only
difference is that, on each step with horizon $h$, rather than
selecting an action specified by a given  policy, we select the value
of $a$ that will maximize the expected $Q^h$ value of the next state.
\begin{align}
  Q^0(s, a) & = 0                                                                 \\
  Q^1(s, a) & = R(s, a) + 0                                                       \\
  % Q^2(s, a) & = R(s, a) + \sum_{s'}T(s, a, s') \max_{a'} R(s', a')         \\
  Q^2(s, a) & = R(s, a) + \gamma \sum_{s'}T(s, a, s') \max_{a'} Q^1(s', a')       \\
  \vdots
  \nonumber                                                                       \\
  Q^h(s, a) & = R(s, a) + \gamma \sum_{s'}T(s, a, s') \max_{a'} Q^{h - 1}(s', a')
\end{align}
where $(s',a')$ denotes the next time-step state/action pair. We can solve for the values of $Q^h$ with a simple recursive algorithm
called {\it{finite-horizon value iteration}} that just computes $Q^h$ starting
from horizon 0 and working backward to the desired horizon
$H$. Given $Q^h$, an optimal $\pi_h^*$ can be found as follows:
\begin{equation}
  \pi_h^*(s) = \arg\max_{a}Q^h(s, a) \;\;.
\end{equation}
which gives the \emph{immediate} best action(s) to take when there are $h$ steps left; then $\pi_{h-1}^*(s)$ gives the best action(s) when there are $(h-1)$ steps left, and so on. In the case where there are multiple best actions, we typically can break ties randomly.


Additionally, it is worth noting that in order for such an optimal policy to be
computed, we assume that the reward function $R(s,a)$ is bounded on the set of
all possible (state, action) pairs. Furthermore, we will assume that the set of
all possible actions is finite.
\question{The optimal value function is unique, but the optimal policy
  is not.  Think of a situation in which there is more than one
  optimal policy.}

\begin{examplebox} {\bf Dynamic programming}
  (somewhat counter-intuitively, dynamic programming is neither really
  ``dynamic'' nor a type of ``programming'' as we typically understand
  it) is a technique for designing efficient algorithms.  Most methods
  for solving MDPs or computing value functions rely on dynamic
  programming to be efficient.  The {\em principle of dynamic
      programming}\index{dynamic programming} is to compute and store
  the solutions to simple sub-problems that can be re-used later in
  the computation.  It is a very important tool in our algorithmic
  toolbox.

  Let's consider what would happen if we tried to compute $Q^4(s,
    a)$ for all $(s, a)$ by directly using the definition:
  \begin{itemize}
    \item To compute $Q^4(s_i, a_j)$ for any one $(s_i, a_j)$, we would
          need to compute $Q^3(s, a)$ for all $(s, a)$
          pairs.
    \item To compute $Q^3(s_i, a_j)$ for any one $(s_i, a_j)$, we'd need to
          compute $Q^2(s, a)$ for all $(s, a)$ pairs.
    \item To compute $Q^2(s_i, a_j)$ for any one $(s_i, a_j)$, we'd
          need to compute $Q^1(s, a)$ for all $(s, a)$ pairs.
    \item Luckily, those are just our $R(s, a)$ values.
  \end{itemize}

  So, if we have $n$ states and $m$ actions, this is $O((mn)^3)$
  work --- that seems like way too much, especially as the horizon
  increases!  But observe that we really only have $mnh$ values that
  need to be computed: $Q^h(s, a)$ for all $h, s, a$.  If we start with
  $h=1$, compute and store those values, then using and reusing the
  $Q^{h-1}(s, a)$ values to compute the $Q^h(s, a)$ values, we can do
  all this computation in time $O(mnh)$, which is much better!

\end{examplebox}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Finding optimal infinite-horizon policies}
\label{sec-mdp_infinite_horizon_optimal}

In contrast to the finite-horizon case, the best way of behaving in an
infinite-horizon discounted {\sc mdp} is not time-dependent. That is,
the decisions you make at time $t = 0$ looking forward to infinity,
will be the same decisions that you make at time $t = T$ for any positive
$T$, also looking forward to infinity.

An important theorem about {\sc mdp}s is: in the infinite-horizon
case, there exists a stationary\index{policy!stationary policy}
\note{Stationary means that it doesn't change over time; in contrast,
the optimal policy in a finite-horizon {\sc mdp} is {\em
    non-stationary.}} optimal policy $\pi^*$ (there may be more than
one) such that for all $s \in \mathcal S$ and all other policies
$\pi$, we have
\begin{equation}
  V_{\pi^*}(s) \ge V_{\pi}(s) \;\;.
\end{equation}

% Let $n = |\mathcal S|$ andd $m = |\mathcal A|$. Algorithm for finding $\pi^*$:
% \begin{itemize}
% \item
% enumerate and test, complexity is $O(m^n)$
% \item
% linear programming, complexity is $O(\text{poly}(n, m b))$ where $b$ is the number of bits per element of $T, R$
% \item
% policy iteration, complexity is $O(\text{poly}(n, m, \text{bits}(\gamma)))$, requires solving lots of $n \times n$ systems
% \item
% {\underline{value iteration}}, complexity is $O\left(\text{poly}(n, m, b, \frac{1}{1 - \gamma}\right)$
% \end{itemize}
% The latter is easy to implement and the foundation for many reinforcement-learning methods.

There are many methods for finding an optimal policy for an {\sc mdp}.
We have already seen the finite-horizon value iteration case.  Here we
will study a very popular and useful method for the infinite-horizon
case, {\em infinite-horizon value iteration}.  It is also important to
us, because it is the basis of many {\em reinforcement-learning} methods.

We will again assume that the reward function $R(s,a)$ is bounded on the
set of all possible (state, action) pairs and additionally that the number
of actions in the action space is finite.
Define $Q(s, a)$ to be the expected infinite-horizon discounted
value of being in state $s$, executing action $a$, and executing
an optimal policy $\pi^*$ thereafter. Using similar reasoning to the
recursive definition of $V_\pi,$ we can express this value
recursively as
\begin{equation}
  Q(s, a) = R(s, a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q(s',
  a') \;\;.
\end{equation}

This is also a set of equations, one for each $(s, a)$ pair.  This
time, though, they are not linear (due to the $\max$ operation), and so
they are not easy to solve.  But there is a theorem that says they
have a unique solution!

Once we know the optimal action-value function, then we can extract an
optimal policy $\pi^*$ as
\begin{equation}
  \pi^*(s) = \arg\max_{a}Q(s, a) \;\;.
\end{equation}
\note{As in the finite-horizon case, there may be more than one optimal
  policy in the infinite-horizon case.}

We can iteratively solve for the $Q^*$ values with the
infinite-horizon value iteration algorithm, shown below:

\begin{codebox}
  \Procname{$\proc{Infinite-Horizon-Value-Iteration}(\mathcal S, \mathcal A, T, R, \gamma, \epsilon)$}
  \li     \For $s \in \mathcal{S}, a \in \mathcal{A}:$
  \Do
  \li        $Q_{\text{old}}(s, a) = 0$
  \End
  \li     \While not converged:
  \Do
  \li        \For $s \in \mathcal{S}, a \in \mathcal{A}:$
  \Do
  \li           $Q_{\text{new}}(s, a) = R(s, a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q_{\text{old}}(s', a')$
  \End

  \li      \If $\max_{s, a}\lvert Q_{\text{old}}(s, a) - Q_{\text{new}}(s, a)\rvert < \epsilon:$
  \Do
  \li           return $Q_{\text{new}}$
  \End
  \li      $Q_{\text{old}} = Q_{\text{new}}$
  \End
\end{codebox}

\paragraph*{Theory}

There are a lot of nice theoretical results about infinite-horizon value iteration.
For some given (not necessarily optimal) $Q$ function, define
$\pi_{Q}(s) = \arg\max_{a}Q(s, a)$.
\begin{itemize}
  \item After executing infinite-horizon value
        iteration with convergence hyper-parameter $\epsilon$,
        \note{Note the new
          notation!  Given two functions $f$ and $f'$, we
          write $\lVert f - f' \rVert_\text{max}$ to mean $\max_x \lvert f(x)
            - f'(x)\rvert$.   It measures the maximum absolute
          disagreement between the two functions at any input $x$.}
        \begin{equation}
          \lVert  V_{\pi_{Q_{\text{new}}}} - V_{\pi^*} \rVert_{\text{max}} < \epsilon \;\; .
        \end{equation}
        %
  \item
        There is a value of $\epsilon$ such that
        \begin{equation}
          \Vert Q_{\text{old}} - Q_{\text{new}} \rVert_{\text{max}} <
          \epsilon \Longrightarrow \pi_{Q_{\text{new}}} = \pi^*
        \end{equation}
  \item  As the algorithm executes,
        $\lVert V_{\pi_{Q_{\text{new}}}} - V_{\pi^*} \Vert_{\text{max}}$ decreases
        monotonically on each iteration.
  \item The algorithm  can be executed asynchronously, in parallel: as
        long as all $(s, a)$ pairs are updated infinitely often in an
        infinite run, it still converges to the optimal value.
        \note{This is very important for reinforcement learning.}

\end{itemize}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: