-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmdp.tex
545 lines (475 loc) · 26.2 KB
/
mdp.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
\chapter{Markov Decision Processes}
\label{chap-mdps}
So far, most of the learning problems we have looked at have been {\em supervised}, that is, for each training input $\ex{x}{i}$, we
are told which value $\ex{y}{i}$ should be the output. From a traditional machine-learning viewpoint, there're two other major groups of learning problems: one is the {\em unsupervised} learning problems, in
which we are given data and no expected outputs, and we will look at later in Chapter \ref{chap-clustering} and Chapter \ref{chap-autoencoders}.
The other major type is the so-called {\em Reinforcement learning}\index{reinforcement learning} (RL) problems. Reinforcement learning differs significantly from supervised learning problems, and we will delve into the details later in in Chapter \ref{chap-reinforcement}. However, it's worth pointing out one major difference at a very high level: in supervised learning, our goal is to learn a one-time static mapping to make predictions, whereas in RL, the setup requires us to sequentially take actions to maximize cumulative rewards.
This setup change necessitates additional mathematical and algorithmic tools for us to understand RL. {\em Markov decision process}\index{Markov decision process} ({\sc mdp}) is precisely such a classical and fundamental tool.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Definition and value functions}
\label{sec_mdps}
Formally, a Markov decision process is $\langle \mathcal S, \mathcal A, T, R,
\gamma\rangle$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, and:
\begin{itemize}
\item
$T : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \R$ is
a {\em transition model}\index{transition model}, where
\[T(s, a, s') = Pr(S_t = s'|S_{t - 1} = s,
A_{t - 1} = a)\;\;,\] specifying a conditional probability distribution;
\note{The notation $S_t = s'$ uses a capital letter $S$ to stand for
a random variable, and small letter $s$ to stand for a concrete value.
So $S_t$ here is a random variable that can take on elements of
$\mathcal S$ as values.}
\item
$R: \mathcal S \times \mathcal A \rightarrow \R$ is a {\em reward function}\index{reward function},
where $R(s, a)$ specifies an immediate reward for taking action $a$ when
in state $s$; and
\item $\gamma \in [0, 1]$ is a {\em discount factor}, which we'll
discuss in Section~\ref{sec-discount}.
\end{itemize}
In this class, we assume the rewards are deterministic functions. Further, in this MDP chapter, we assume the state space and action space are finite (in fact, typically small).
\begin{examplebox}
The following description of a simple machine as Markov decision
process provides a concrete example of an MDP.
%
% An agent controls the actions taken, while the environment responds
% with the transition to the next state.
%
The machine has three possible operations ({\em actions}): ``wash'',
``paint'', and ``eject'' (each with a corresponding button). Objects
are put into the machine. Each time you push a button, something is
done to the object. However, it's an old machine, so it's not very
reliable. The machine has a camera inside that can clearly detect what
is going on with the object and will output the state of the object:
``dirty'', ``clean'', ``painted'', or ``ejected''. For each action,
this is what is done to the object:
\noindent {\bf Wash}:
\begin{itemize}
\item If you perform the ``wash'' operation on any object, whether
it's dirty, clean, or painted, it will end up ``clean'' with
probability 0.9 and ``dirty'' otherwise.
\end{itemize}
\noindent {\bf Paint}:
\begin{itemize}
\item If you perform the ``paint'' operation on a clean object, it
will become nicely ``painted'' with probability 0.8. With
probability 0.1, the paint misses but the object stays clean, and
also with probability 0.1, the machine dumps rusty dust all over the
object and it becomes ``dirty''.
\item If you perform the ``paint'' operation on a ``painted'' object,
it stays ``painted'' with probability 1.0.
\item If you perform the ``paint'' operation on a ``dirty'' part, it
stays ``dirty'' with probability 1.0.
\end{itemize}
\noindent {\bf Eject}:
\begin{itemize}
\item If you perform an ``eject'' operation on any part, the part
comes out of the machine and this fun game is over. The part remains
"ejected" regardless of any further action.
\end{itemize}
\noindent
These descriptions specify the transition model $T$, and the
transition function for each action can be depicted as a state machine
diagram. For example, here is the diagram for ``wash'':
%
% As for the state machine diagrams presented above, states are denoted
% by large circles and actions by small black circles. The arc from a
% state to an action is followed by the possible arcs (from the small
% action circle) showing the probability of transition from the source
% state to the indicated destination state.
%
\begin{center}
\begin{tikzpicture}[->]
\node[state] (dirty) {dirty};
\node[state, right of=dirty, xshift=2.5cm] (clean) {clean};
\node[state, below of=dirty, yshift=-2.5cm] (painted) {painted};
\node[state, below of=clean, yshift=-2.5cm] (ejected) {ejected};
\draw (dirty) edge[loop above] node{0.1} (dirty)
(dirty) edge[bend left, above] node{0.9} (clean)
(clean) edge[loop above] node{0.9} (clean)
(clean) edge[bend left, below] node{0.1} (dirty)
(painted) edge[left] node{0.1} (dirty)
(painted) edge[right] node{~0.9} (clean)
(ejected) edge[loop above] node{1.0} (ejected);
\end{tikzpicture}
\end{center}
You get reward +10 for ejecting a painted object, reward 0 for
ejecting a non-painted object, reward 0 for any action on an "ejected"
object, and reward -3 otherwise. The MDP description would be
completed by also specifying a discount factor.
\end{examplebox}
A {\it{policy}}\index{policy} is a function $\pi$
that specifies what action to take in each state. The policy is what
we will want to learn; it is akin to the strategy that a player
employs to win a given game. Below, we take just the initial steps
towards this eventual goal. We describe how to evaluate how good a
policy is, first in the {\em finite horizon} case
(Section~\ref{sec-mdp_finite_horizon}) when the total number of
transition steps is finite. In the finite horizon case, we typically denote
the policy as $\pi_h(s)$, where $h\in\mathbb N$ is a non-negative integer
denoting the number of steps remaining and $s \in \mathcal S$ is the current
state. Then we consider the {\em infinite horizon}
case (Section~\ref{sec-mdp_infinite_horizon}), when you
don't know when the game will be over.
% There are important algorithms that can be used to find good
% policies, under certain assumptions. These will be discussed in
% Chapter~\ref{chap-reinforcement}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Finite-horizon value functions}
\index{policy!finite horizon}
\label{sec-mdp_finite_horizon}
The goal of a policy is to maximize the expected total reward, averaged over
the stochastic transitions that the domain makes. Let's first
consider the case where there is a finite {\em horizon} $H$,
indicating the total number of steps of interaction that the agent
will have with the {\sc mdp}\note{In the finite-horizon case, we usually
set the discount factor $\gamma$ to 1.}.
We seek to measure the goodness of a
policy. We do so by defining for a given horizon $h$ and {\sc mdp} policy $\pi_h$,
the ``horizon $h$ {\em value}'' of a state, $V^{h}_\pi(s)$. We
do this by induction on the horizon, which is the {\em number of steps
left to go}.
The base case is when there are no steps remaining, in which case, no
matter what state we're in, the value is 0, so
\begin{equation}
V^0_{\pi}(s) = 0\;\;.
\label{eq:finite_val_0}
\end{equation}
Then, the value of a policy in state $s$ at horizon $h + 1$ is equal
to the reward it will get in state $s$ plus the next state's expected horizon $h$
value, discounted by a factor $\gamma$. So, starting with horizons 1 and 2, and then
moving to the general case, we have:
\begin{align}
V^1_{\pi}(s) & = R(s, \pi_1(s)) + 0 \\
\label{eq:finite_val_1}
% V^2_{\pi}(s) & = R(s, \pi_2(s)) + \sum_{s'}T(s, \pi_2(s), s') R(s', \pi(s')) \\
V^2_{\pi}(s) & = R(s, \pi_2(s)) + \gamma \sum_{s'}T(s, \pi_2(s), s') V^1_{\pi}(s') \\
\vdots
\nonumber \\
V^h_{\pi}(s) & = R(s, \pi_h(s)) + \gamma \sum_{s'}T(s, \pi_h(s), s') V^{h - 1}_{\pi}(s')
\label{eq:finite_value}
\end{align}
The sum over $s'$ is an {\em expectation}\index{expectation}: it
considers all possible next states $s'$, and computes an average of
their $(h-1)$-horizon values, weighted by the probability that the
transition function from state $s$ with the action chosen by the
policy $\pi_h(s)$ assigns to arriving in state $s'$, and discounted by $\gamma$.
\question{What is $\sum_{s'} T(s, a, s')$ for any particular $s$ and $a$?}
\question{Convince yourself that Eqs.~\ref{eq:finite_val_0}
and~\ref{eq:finite_val_1} are special cases of Eq.~\ref{eq:finite_value}.}
Then we can say that a policy $\pi$ is better than policy $\bar{\pi}$ for horizon
$h$ if and only if for all $s \in \mathcal S$,
$V_{\pi}^h(s) \geq V_{\bar{\pi}}^h(s)$ and there exists at least one $s
\in \mathcal S$ such that $V_{\pi}^h(s) > V_{\bar{\pi}}^h(s)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Infinite-horizon value functions}
\label{sec-mdp_infinite_horizon}
\label{sec-discount}
\index{policy!infinite horizon}
More typically, the actual finite horizon is not known, i.e., when you
don't know when the game will be over! This is called the {\em
infinite horizon} version of the problem. How does one evaluate the
goodness of a policy in the infinite horizon case?
If we tried to simply take our definitions above and use them for an
infinite horizon, we could get in trouble. Imagine we get a reward of
1 at each step under one policy and a reward of 2 at each step under a
different policy. Then the reward as the number of steps grows in each
case keeps growing to become infinite in the limit of more and more
steps. Even though it seems intuitive that the second policy should
be better, we can't justify that by saying $\infty < \infty$.
One standard approach to deal with this problem is to consider the
\emph{discounted} infinite horizon\index{policy!infinite horizon!discount factor}.
We will generalize from the finite-horizon case by adding a discount factor.
In the finite-horizon case, we valued a
policy based on an expected finite-horizon value:
\begin{equation}
\mathbb{E}\left[\sum_{t = 0}^{h-1} \gamma^t R_t \mid \pi, s_0\right]\;\;,
\label{eq:exp_finite}
\end{equation}
where $R_t$ is the reward received at time $t$.
\begin{examplebox}
What is $\mathbb{E}\left[ \cdot \right ]$? This mathematical
notation indicates an {\em expectation}, i.e., an average
taken over all the random possibilities which may occur for the
argument. Here, the expectation is taken over the {\em conditional
probability} $Pr(R_t = r \mid \pi, s_0)$, where $R_t$ is the random
variable for the reward, subject to the policy being $\pi$ and the
state being $s_0$. Since $\pi$ is a function, this notation is
shorthand for conditioning on all of the random variables implied by
policy $\pi$ and the stochastic transitions of the MDP.
\bigskip
A very important point is that $R(s,a)$ is always deterministic (in this class) for
any given $s$ and $a$. Here $R_t$ represents the set of all possible
$R(s_t,a)$ at time step $t$; this $R_t$ is a random variable because
the state we're in at step $t$ is itself a random variable, due to prior
stochastic state transitions up to but not including at step $t$ and prior
(deterministic) actions dictated by policy $\pi.$
\end{examplebox}
Now, for the infinite-horizon case, we select a discount factor $0 \leq \gamma \leq 1$,
and evaluate a policy based on its expected {\it{infinite horizon discounted value}}:
\begin{equation}
\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^tR_t \mid \pi, s_0\right]
= \mathbb{E}\left[R_0 + \gamma R_1 + \gamma^2 R_2 + \ldots \mid \pi, s_0\right] \;\;.
\label{eq:exp_infinite}
\end{equation}
Note that the $t$ indices here are not the number of steps to go, but
actually the number of steps forward from the starting state (there is
no sensible notion of ``steps to go'' in the infinite horizon case).
\begin{examplebox}
Eqs.~\ref{eq:exp_finite} and~\ref{eq:exp_infinite} are a conceptual
stepping stone. Our main objective is to get to
Eq.~\ref{eq:inifite_horiz_value}, which can also be viewed
as including $\gamma$ in Eq.~\ref{eq:finite_value}, with
the appropriate definition of the infinite-horizon value.
\end{examplebox}
There are two good intuitive motivations for discounting. One is
related to economic theory and the present value of money: you'd
generally rather have some money today than that same amount of money
next week (because you could use it now or invest it). The other is
to think of the whole process terminating, with probability $1-\gamma$
on each step of the interaction\note{At every
step, your expected future lifetime, given that you have survived
until now, is $1 / (1 - \gamma)$.}. This value is the expected amount of
reward the agent would gain under this terminating model.
\question{Verify this fact: if, on every day you wake up, there is a
probability of $1 - \gamma$ that today will be your last day, then
your expected lifetime is $1 / (1 - \gamma)$ days. }
Let us now evaluate a policy in terms of the expected discounted
infinite-horizon value that the agent will get in the {\sc mdp} if it
executes that policy. We define the infinite-horizon value of a state
$s$ under policy $\pi$ as
\begin{equation}
V_{\pi}(s) = \mathbb{E}[R_0 + \gamma R_1 + \gamma^2 R_2 +
\dots \mid \pi, S_0 = s] = \mathbb{E}[R_0 + \gamma(R_1 + \gamma(R_2 + \gamma
\dots))) \mid \pi, S_0 = s] \;\;.
\end{equation}
Because the expectation of a linear combination of random variables is
the linear combination of the expectations, we have
\begin{align}
V_{\pi}(s) & = \mathbb{E}[R_0 \mid \pi, S_0 = s] + \gamma \mathbb{E}[
R_1 + \gamma(R_2 + \gamma \dots))) \mid \pi, S_0 = s]
\nonumber
\\
& = R(s, \pi(s)) + \gamma\sum_{s'}T(s, \pi(s), s')V_{\pi}(s') \; .
\label{eq:inifite_horiz_value}
\end{align}\note{This is {\em so} cool! In a discounted model, if you find that
you survived this round and landed in some state $s'$, then you have
the same expected future lifetime as you did before. So the value
function\index{value function} that is relevant in that state is
exactly the same one as in state $s$.}
The equation defined in Eq.~\ref{eq:inifite_horiz_value} is known as the Bellman Equation,
which breaks down the value function into the immediate reward and the (discounted) future
value function.
You could write down one of these equations for each of the $n =
|\mathcal S|$ states. There are $n$ unknowns $V_{\pi}(s)$. These are
linear equations, and standard software (e.g., using Gaussian
elimination or other linear algebraic methods) will, in most cases,
enable us to find the value of each state under this policy.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Finding policies for MDPs}
\label{sec-finding_mdp_policies}
Given an {\sc mdp}, our goal is typically to find a policy that is
optimal in the sense that it gets as much total reward as possible, in
expectation over the stochastic transitions that the domain makes. We
build on what we have learned about evaluating the goodness of a
policy (Sections~\ref{sec-mdp_finite_horizon}
and~\ref{sec-mdp_infinite_horizon}), and find optimal policies for the
finite horizon case (Section~\ref{sec-mdp_finite_horizon_optimal}),
then the infinite horizon case
(Section~\ref{sec-mdp_infinite_horizon_optimal}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Finding optimal finite-horizon policies}
\label{sec-mdp_finite_horizon_optimal}
How can we go about finding an optimal policy for an {\sc mdp}?
\index{policy!optimal policy} We could imagine enumerating all
possible policies and calculating their value functions\index{value
function!calculating value function} as in the previous section and
picking the best one -- but that's too much work!
The first observation to make is that, in a finite-horizon problem,
the best action to take depends on the current state, but also on the
horizon: imagine that you are in a situation where you could reach a
state with reward 5 in one step or a state with reward 100 in two
steps. If you have at least two steps to go, then you'd move toward
the reward 100 state, but if you only have one step left to go, you should
go in the direction that will allow you to gain 5!
One way to find an optimal policy is to compute an optimal {\em action-value function}, $Q$\index{Q-function}\index{optimal
action-value function}. For the finite-horizon case, we define
$Q^h(s, a)$ to be the expected value of
\begin{itemize}
\item starting in state $s$,
\item executing action $a$, and
\item continuing for $h - 1$ more steps executing an optimal policy
for the appropriate horizon on each step.
\end{itemize}
Similar to our definition of $V^h$ for evaluating a policy, we define
the $Q^h$ function recursively according to the horizon. The only
difference is that, on each step with horizon $h$, rather than
selecting an action specified by a given policy, we select the value
of $a$ that will maximize the expected $Q^h$ value of the next state.
\begin{align}
Q^0(s, a) & = 0 \\
Q^1(s, a) & = R(s, a) + 0 \\
% Q^2(s, a) & = R(s, a) + \sum_{s'}T(s, a, s') \max_{a'} R(s', a') \\
Q^2(s, a) & = R(s, a) + \gamma \sum_{s'}T(s, a, s') \max_{a'} Q^1(s', a') \\
\vdots
\nonumber \\
Q^h(s, a) & = R(s, a) + \gamma \sum_{s'}T(s, a, s') \max_{a'} Q^{h - 1}(s', a')
\end{align}
where $(s',a')$ denotes the next time-step state/action pair. We can solve for the values of $Q^h$ with a simple recursive algorithm
called {\it{finite-horizon value iteration}} that just computes $Q^h$ starting
from horizon 0 and working backward to the desired horizon
$H$. Given $Q^h$, an optimal $\pi_h^*$ can be found as follows:
\begin{equation}
\pi_h^*(s) = \arg\max_{a}Q^h(s, a) \;\;.
\end{equation}
which gives the \emph{immediate} best action(s) to take when there are $h$ steps left; then $\pi_{h-1}^*(s)$ gives the best action(s) when there are $(h-1)$ steps left, and so on. In the case where there are multiple best actions, we typically can break ties randomly.
Additionally, it is worth noting that in order for such an optimal policy to be
computed, we assume that the reward function $R(s,a)$ is bounded on the set of
all possible (state, action) pairs. Furthermore, we will assume that the set of
all possible actions is finite.
\question{The optimal value function is unique, but the optimal policy
is not. Think of a situation in which there is more than one
optimal policy.}
\begin{examplebox} {\bf Dynamic programming}
(somewhat counter-intuitively, dynamic programming is neither really
``dynamic'' nor a type of ``programming'' as we typically understand
it) is a technique for designing efficient algorithms. Most methods
for solving MDPs or computing value functions rely on dynamic
programming to be efficient. The {\em principle of dynamic
programming}\index{dynamic programming} is to compute and store
the solutions to simple sub-problems that can be re-used later in
the computation. It is a very important tool in our algorithmic
toolbox.
Let's consider what would happen if we tried to compute $Q^4(s,
a)$ for all $(s, a)$ by directly using the definition:
\begin{itemize}
\item To compute $Q^4(s_i, a_j)$ for any one $(s_i, a_j)$, we would
need to compute $Q^3(s, a)$ for all $(s, a)$
pairs.
\item To compute $Q^3(s_i, a_j)$ for any one $(s_i, a_j)$, we'd need to
compute $Q^2(s, a)$ for all $(s, a)$ pairs.
\item To compute $Q^2(s_i, a_j)$ for any one $(s_i, a_j)$, we'd
need to compute $Q^1(s, a)$ for all $(s, a)$ pairs.
\item Luckily, those are just our $R(s, a)$ values.
\end{itemize}
So, if we have $n$ states and $m$ actions, this is $O((mn)^3)$
work --- that seems like way too much, especially as the horizon
increases! But observe that we really only have $mnh$ values that
need to be computed: $Q^h(s, a)$ for all $h, s, a$. If we start with
$h=1$, compute and store those values, then using and reusing the
$Q^{h-1}(s, a)$ values to compute the $Q^h(s, a)$ values, we can do
all this computation in time $O(mnh)$, which is much better!
\end{examplebox}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Finding optimal infinite-horizon policies}
\label{sec-mdp_infinite_horizon_optimal}
In contrast to the finite-horizon case, the best way of behaving in an
infinite-horizon discounted {\sc mdp} is not time-dependent. That is,
the decisions you make at time $t = 0$ looking forward to infinity,
will be the same decisions that you make at time $t = T$ for any positive
$T$, also looking forward to infinity.
An important theorem about {\sc mdp}s is: in the infinite-horizon
case, there exists a stationary\index{policy!stationary policy}
\note{Stationary means that it doesn't change over time; in contrast,
the optimal policy in a finite-horizon {\sc mdp} is {\em
non-stationary.}} optimal policy $\pi^*$ (there may be more than
one) such that for all $s \in \mathcal S$ and all other policies
$\pi$, we have
\begin{equation}
V_{\pi^*}(s) \ge V_{\pi}(s) \;\;.
\end{equation}
% Let $n = |\mathcal S|$ andd $m = |\mathcal A|$. Algorithm for finding $\pi^*$:
% \begin{itemize}
% \item
% enumerate and test, complexity is $O(m^n)$
% \item
% linear programming, complexity is $O(\text{poly}(n, m b))$ where $b$ is the number of bits per element of $T, R$
% \item
% policy iteration, complexity is $O(\text{poly}(n, m, \text{bits}(\gamma)))$, requires solving lots of $n \times n$ systems
% \item
% {\underline{value iteration}}, complexity is $O\left(\text{poly}(n, m, b, \frac{1}{1 - \gamma}\right)$
% \end{itemize}
% The latter is easy to implement and the foundation for many reinforcement-learning methods.
There are many methods for finding an optimal policy for an {\sc mdp}.
We have already seen the finite-horizon value iteration case. Here we
will study a very popular and useful method for the infinite-horizon
case, {\em infinite-horizon value iteration}. It is also important to
us, because it is the basis of many {\em reinforcement-learning} methods.
We will again assume that the reward function $R(s,a)$ is bounded on the
set of all possible (state, action) pairs and additionally that the number
of actions in the action space is finite.
Define $Q(s, a)$ to be the expected infinite-horizon discounted
value of being in state $s$, executing action $a$, and executing
an optimal policy $\pi^*$ thereafter. Using similar reasoning to the
recursive definition of $V_\pi,$ we can express this value
recursively as
\begin{equation}
Q(s, a) = R(s, a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q(s',
a') \;\;.
\end{equation}
This is also a set of equations, one for each $(s, a)$ pair. This
time, though, they are not linear (due to the $\max$ operation), and so
they are not easy to solve. But there is a theorem that says they
have a unique solution!
Once we know the optimal action-value function, then we can extract an
optimal policy $\pi^*$ as
\begin{equation}
\pi^*(s) = \arg\max_{a}Q(s, a) \;\;.
\end{equation}
\note{As in the finite-horizon case, there may be more than one optimal
policy in the infinite-horizon case.}
We can iteratively solve for the $Q^*$ values with the
infinite-horizon value iteration algorithm, shown below:
\begin{codebox}
\Procname{$\proc{Infinite-Horizon-Value-Iteration}(\mathcal S, \mathcal A, T, R, \gamma, \epsilon)$}
\li \For $s \in \mathcal{S}, a \in \mathcal{A}:$
\Do
\li $Q_{\text{old}}(s, a) = 0$
\End
\li \While not converged:
\Do
\li \For $s \in \mathcal{S}, a \in \mathcal{A}:$
\Do
\li $Q_{\text{new}}(s, a) = R(s, a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q_{\text{old}}(s', a')$
\End
\li \If $\max_{s, a}\lvert Q_{\text{old}}(s, a) - Q_{\text{new}}(s, a)\rvert < \epsilon:$
\Do
\li return $Q_{\text{new}}$
\End
\li $Q_{\text{old}} = Q_{\text{new}}$
\End
\end{codebox}
\paragraph*{Theory}
There are a lot of nice theoretical results about infinite-horizon value iteration.
For some given (not necessarily optimal) $Q$ function, define
$\pi_{Q}(s) = \arg\max_{a}Q(s, a)$.
\begin{itemize}
\item After executing infinite-horizon value
iteration with convergence hyper-parameter $\epsilon$,
\note{Note the new
notation! Given two functions $f$ and $f'$, we
write $\lVert f - f' \rVert_\text{max}$ to mean $\max_x \lvert f(x)
- f'(x)\rvert$. It measures the maximum absolute
disagreement between the two functions at any input $x$.}
\begin{equation}
\lVert V_{\pi_{Q_{\text{new}}}} - V_{\pi^*} \rVert_{\text{max}} < \epsilon \;\; .
\end{equation}
%
\item
There is a value of $\epsilon$ such that
\begin{equation}
\Vert Q_{\text{old}} - Q_{\text{new}} \rVert_{\text{max}} <
\epsilon \Longrightarrow \pi_{Q_{\text{new}}} = \pi^*
\end{equation}
\item As the algorithm executes,
$\lVert V_{\pi_{Q_{\text{new}}}} - V_{\pi^*} \Vert_{\text{max}}$ decreases
monotonically on each iteration.
\item The algorithm can be executed asynchronously, in parallel: as
long as all $(s, a)$ pairs are updated infinitely often in an
infinite run, it still converges to the optimal value.
\note{This is very important for reinforcement learning.}
\end{itemize}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: