sentences.txt

We call ourselves {\it Homo sapiens}\itindex{Homo sapiens}---man the
wise---because our \newtermi{intelligence} is so important to us.

The field of
\newterm{artificial intelligence}, or
AI} attempts not just to understand but
also to {\em build} intelligent entities.

Some have defined intelligence and measured success in terms of of fidelity
to {\em human} performance, while others an abstract,
formal definition of intelligence, called
\newterm{rationality}---loosely speaking, doing the
``right thing.''

\subsection{Acting humanly: The Turing Test approach}


The \newterm{Turing Test}, proposed by
Alan

\plainnewtermitem{natural language processing} to enable it to
communicate successfully in English;
\plainnewtermitem{knowledge representation}
to store what it knows or hears;
\plainnewtermitem{automated reasoning} to answer
questions and to draw new conclusions;
\plainnewtermitem{machine learning} to adapt to new 
circumstances and to
detect and extrapolate patterns.

However, the so-called
\newterm{total Turing Test} includes
a video signal so that the interrogator can test the subject's
perceptual abilities, as well as the opportunity for the interrogator to
pass physical objects ``through the hatch.''

To pass the total Turing Test,
the computer will need

\plainnewtermitem{computer vision}\index{computer 
vision} to perceive objects, and
\plainnewtermitem{robotics} to manipulate objects and move about.

The interdisciplinary field of \newterm{cognitive
science} brings together computer models from AI and
experimental techniques from psychology to construct precise and testable
theories of the human mind.

His \newterm[syllogism]{syllogisms} provided patterns for argument
structures that always yielded correct conclusions when given correct
premises---for example, ``Socrates is a man; all men are
mortal; therefore, Socrates is mortal.''

These laws of thought were supposed to govern the
operation of the mind; their study initiated the field called
\newterm{logic}.

The so-called
\newterm{logicist} tradition within artificial intelligence
hopes to build on such programs to create intelligent systems.

The theory of
\newterm{probability} fills this gap, allowing rigorous
reasoning with uncertain information.


\subsection{Acting rationally: The rational agent approach}


An \newterm{agent} is just something that acts ({\it
agent} comes from the Latin {\it agere}, to do).

A \newterm{rational agent} is one that acts so as
to achieve the best outcome or, when there is uncertainty, the best
expected outcome.

In other words, we define artificial intelligence as the study and
construction of agents that \newtermi{do the right thing}.

Chapters \ref{game-playing-chapter}
and~\ref{complex-decisions-chapter} deal with the issue of \newtermi{limited
rationality}---acting appropriately when there is
not enough time to do all the computations one might like.

Ultimately, we want
agents that are \newterm{provably beneficial}\ntindex{artificial
intelligence!provably beneficial} to humans.

Like Aristotle\nindex{Aristotle} and Leibniz, Descartes was a strong advocate of
the power of reasoning in understanding the world, a philosophy now
called \newtermi{rationalism}.

But Descartes was also a proponent of
\newterm{dualism}.

An alternative to
dualism is \termi{materialism} or \newtermi{naturalism}, which holds that
the brain's operation according to the laws of physics {\em
constitutes} the mind.

The
\newterm{empiricism} movement, starting with Francis
Bacon's\nindex{Bacon, F.} (1561--1626) {\it Novum Organum},\footnote{The
{\em Novum Organum} is an update of Aristotle's\nindex{Aristotle} {\em
Organon}, or instrument of thought.}

David Hume's
(1711--1776) {\it A Treatise of Human Nature}~\cite{Hume:1739} proposed
what is now known as the principle of
\newterm{induction}: that general rules are acquired
by exposure to repeated associations between their elements.

(1872--1970), the famous Vienna
Circle \cite{Sigmund:2017}, a group of philosophers and mathematicians
meeting in Vienna in the 1920s and 1930s, developed the doctrine of
\newterm{logical positivism}\ntindex{logical
positivism}.

This doctrine holds that all
knowledge can be characterized by logical theories connected, ultimately,
to \newterm{observation sentences}\tindex{observation sentences} that
correspond to sensory inputs; thus logical positivism combines
rationalism and empiricism.

The \newterm{confirmation
theory} of Rudolf Carnap\nindex{Carnap, R.}
(1891--1970) and Carl Hempel\nindex{Hempel, C.} (1905--1997) attempted to
analyze the acquisition of knowledge from experience by quantifying the
degree of belief that should be assigned to logical sentences based on
their connection to observations that confirm or disconfirm them.

Jeremy Bentham~\citeyear{Bentham:1823} and
John Stuart Mill~\citeyear{Mill:1863} promoted the idea of
\newtermi{utilitarianism}; that rational decision making should apply to
all spheres of human activity.

The idea of \newtermi{formal logic} can be traced back to the
philosophers of ancient Greece, but its mathematical development
really began with the work of George Boole\nindex{Boole, G.}
(1815--1864), who worked out the details of propositional, or Boolean,
logic~\cite{Boole:1847}.

The theory of \newtermi{probability} can be seen as generalizing logic to
situations with uncertain information---a consideration of critical importance
for AI.

The formalization of probability, combined with the availability of data, led
to the emergence of \newtermi{statistics} as a field.

The history of computation is as old as the history of numbers, but the first
nontrivial \newterm{algorithm} is thought to be Euclid's
algorithm for computing greatest common divisors.

His \newterm{incompleteness
theorem}\tindex{incompleteness!theorem}\tindex{theorem!incompleteness} showed
that in any formal theory as strong as Peano arithmetic (the elementary theory
of natural numbers), there are true statements that are
undecidable in the sense that they have no proof within
the theory.

(1912--1954) to try to
characterize exactly which functions {\em are} \newterm{computable}---capable
of being computed by an effective procedure.

Although decidability and computability are important to an understanding of
computation, the notion of \newterm{tractability}\ntindex{tractability of
inference} has had an even greater impact.

The theory of \newterm{NP-completeness}\tindex{NP-complete},
pioneered by Steven Cook~\citeyear{Cook:1971} and Richard
Karp~\citeyear{Karp:1972}, provides a basis for analyzing the tractability of
problems: any problem class to which the class of NP-complete problems can be
reduced is likely to be intractable.

He proposed instead a principle based on maximization of
expected \newtermi{utility}, an internal, subjective quantity, and explained
human investment choices by proposing that the marginal utility of an
additional quantity of money diminished as one acquired more money.

\newterm{Decision theory}, which combines probability
theory with utility theory, provides a formal and complete framework for
decisions (economic or otherwise) made under uncertainty---that is, in cases
where probabilistic descriptions appropriately capture the decision maker's
environment.

Von Neumann and Morgenstern's development of
\newterm{game theory}\tindex{game theory}~\cite<see also>{Luce+Raiffa:1957}
included the surprising result that, for some games, a rational agent should
adopt policies that are (or least appear to be) randomized.

This topic was pursued in the field of
\newterm{operations research}, which emerged in World War
II from efforts in Britain to optimize radar
installations, and later found civilian applications in complex
management decisions.

(1916--2001) won the Nobel Prize in economics in
1978 for his early work showing that models based on
\newtermi{satisficing}---making decisions that are ``good enough,''
rather than laboriously calculating an optimal decision---gave a better
description of actual human behavior \cite{Simon:1947}.

\newterm{Neuroscience} is the study of the
nervous system, particularly the brain.

By that time, it was known that the brain consisted largely of nerve cells,
or \newterm[neuron]{neurons}, but it was not until 1873 that
Camillo Golgi (1843--1926)\nindex{Golgi, C.}  developed a staining
technique allowing the observation of individual neurons in the brain
(see \figref{neuron-figure}).

These are augmented by advances in
single-cell electrical recording of neuron activity and by the methods
of \newtermi{optogenetics}~\cite{Crick:1999,Zemelman+al:2002,Han+Boyden:2007}, which allow both measurement and control of individual neurons
modified to be light-sensitive.

The development of
\newterm[brain--machine interface]{brain--machine
interfaces}\ntindex{brain--machine
interface}~\cite{Lebedev+Nicolelis:2006} for both sensing and motor
control not only promises to restore function to disabled individuals
but also sheds light on many aspects of neural systems.

Futurists make much of these numbers, pointing to an
approaching \newtermi{singularity} at which computers reach a superhuman
level of
performance~\cite{Vinge:1993,Kurzweil:2005,Doctorow+Stross:2012}, and
then rapidly improve themselves even further.

Applying this viewpoint to humans,
the \newterm{behaviorism} movement, led by John
Watson\nindex{Watson, J.}

\newterm{Cognitive psychology}, which
views the brain as an information-processing device, can be traced
back at least to the works of William James\nindex{James,
W.} (1842--1910).

The central figure in the creation of what is now called
\newterm{control theory} was
Norbert Wiener\nindex{Wiener, N.} (1894--1964).

Wiener's book {\it
Cybernetics}~\citeyear{Wiener:1948}\newterm[cybernetics]{}
became a bestseller and awoke the public to the possibility of
artificially intelligent machines.

Ashby's {\em Design for a
Brain\/} \citeyear{Ashby:1948,Ashby:1952} elaborated on his idea that
intelligence could be created by the use of
\newtermi{homeostatic} devices containing
appropriate feedback loops to achieve stable adaptive behavior.

Modern control theory, especially the branch known as stochastic
optimal control, has as its goal the design of systems that maximize
an \newtermi{objective function} over time.

Modern linguistics and AI, then, were ``born'' at about the same time, and grew
up together, intersecting in a hybrid field called \newtermi{computational
linguistics} or \term{natural language processing}\tindex{language!processing}.

His rule, now called \newtermi{Hebbian
learning}, remains an influential model to this day.

The success of \nosystem{GPS}
and subsequent programs as models of cognition led
\citeA{Newell+Simon:1976} to formulate the famous
\newterm{physical symbol system} hypothesis, which states that ``a 
physical symbol system has the necessary and sufficient means for 
general intelligent action.''

In 1958, in MIT AI Lab Memo No.~1, John McCarthy defined the high-level
language \newterm{Lisp}, which was to become the dominant AI
programming language for
the next 30 years.

These limited domains
became known as \newterm[microworld]{microworlds}.

Early experiments in \newterm{machine
evolution} (now called
\newtermi{genetic programming}) \cite{Friedberg:1958,Friedberg+al:1959} were
based on the undoubtedly correct belief that by making an appropriate series of
small mutations to a machine-code program, one can generate a
program with good performance for any particular task.

Such approaches have been called \newterm[weak method]{weak
methods} because, although general, they do not scale up
to large or difficult problem instances.

In 1971, Feigenbaum and others at 
Stanford began the Heuristic Programming 
Project (HPP) to investigate the 
extent to which the new methodology of \newterm{expert 
systems} could be applied to other 
areas of human expertise.

\system{Mycin} incorporated a calculus of uncertainty called 
\newterm[certainty factor]{certainty factors} (see 
\chapref{bayes-nets-chapter}), which seemed (at the time) to fit well 
with how doctors assessed the impact of evidence on the diagnosis.

Others, following Minsky's idea
of \newterm{frames}~\citeyear{Minsky:1975},
adopted a more structured approach, assembling facts about particular object
and event types and arranging the types into a large taxonomic
hierarchy analogous to a biological taxonomy.


\subsection{The return of neural networks (1986--present)}

In the mid-1980s at least four different groups reinvented the
\newtermi{back-propagation} learning algorithm first found in 1969 by Bryson
and Ho.

These so-called \newterm{connectionist} models were seen
by some as direct competitors both to the symbolic models promoted by Newell
and Simon and to the logicist approach of McCarthy and others.

In the 1980s, approaches using
\newterm{hidden Markov models} (HMMs) came to
dominate the area.

Pearl's development of \newterm[Bayesian network]{Bayesian
networks}\tindex{Bayesian network} yielded a rigorous and efficient formalism
for representing uncertain knowledge as well as practical algorithms for
probabilistic reasoning.

The term \newtermi{deep learning} refers to machine learning using networks
with multiple layers.

They also produce an \newtermi{AI Index} at {\tt aiindex.org}
to help track progress.

{\bf Image understanding}: Not content with exceeding human accuracy on
the challenging \system{ImageNet} object recognition task, computer vision
researchers have taken on the more difficult problem of \newtermi{image
captioning}.

At present,
the research community and the major corporations involved in AI research have
developed voluntary self-governance principles for AI-related activities,
including the \newterm{Asilomar AI Principles} by the Future of Life
Institute~\citeyear{FLI:2017} and the tenets of the Partnership on
AI~\citeyear{PAI:2017}.

They called the
effort \newtermi{human-level AI} or HLAI; their first symposium was in 2004
\cite{Minsky+al:2004}.

Another effort with similar goals, the so-called AGI or
\newtermi{Artificial General Intelligence}
movement~\cite{Goertzel+Pennachin:2007}, held its first conference and
organized the {\em Journal of Artificial General Intelligence} in 2008.

We might call this the
\newtermi{gorilla problem}: about seven million years ago, a now-extinct
primate evolved, with one branch leading to gorillas and one to humans.

To pick on one
example, we might call this the \newtermi{King Midas problem}:
Midas, a legendary King on Greek mythology, asked that everything he
touched should turn to gold, but then regretted it when his food and
drink turned to gold and he died of starvation.

In modern terminology,
we call this the problem of \newtermi{value alignment}: the values or
objectives put into the machine must be aligned with those of the
human.

In \chapref{reinforcement-learning-chapter},
we explain the methods of \newtermi{inverse reinforcement learning}
that allow machines to learn more about human preferences from
observations of the choices that humans make.


An \termi{agent} is anything that can be viewed as
perceiving its
\newterm{environment} through
\newterm[sensor]{sensors} and acting upon
that environment through
\newterm[actuator]{actuators}.

We use the term
\newtermi{percept} to refer to the agent's perceptual
inputs at any given instant.

An agent's \newtermi{percept sequence}
is the complete history of everything the agent has ever perceived.

Mathematically speaking, we say that an agent's behavior
is described by the \newterm{agent function} that maps
any given percept sequence to an action.

{\em Internally}, the agent function
for an artificial agent will be implemented by an
\newterm{agent program}.

\section{Good Behavior: The Concept of Rationality}


A \newterm{rational agent} is one that does
the right thing---conceptually speaking, every entry in the table for
the agent function is filled out correctly.


Moral philosophy has developed several
different notions of the ``right thing,'' but AI has generally stuck
to one notion called \newtermi{consequentialism}: we evaluate an
agent's behavior by its consequences.

This notion of desirability is captured by a \newterm{performance
measure} that evaluates any given
sequence of environment states.


This leads to a \newterm{definition of a rational agent}\tindex{agent!rational}:

For each possible percept sequence, a rational agent should select an
action that is expected to maximize its performance measure,
given the evidence provided by the percept sequence and
whatever built-in knowledge the agent has.


Consider the simple vacuum-cleaner agent that cleans a square if it is
dirty and moves to the other square if not; this is the agent function
tabulated in \tabref{vacuum-agent-function-table}.

If the geography of the environment is unknown, the agent will
need to \newterm[exploration]{explore} it.


\subsection{Omniscience, learning, and autonomy}

We need to be careful to distinguish between rationality and
\newtermi{omniscience}.

Doing actions {\em in order to modify future
percepts}---sometimes called
\newterm{information gathering}---is an
important part of rationality and is covered in depth in
\chapref{decision-theory-chapter}.

Our definition requires a rational agent not only to gather
information but also to \newterm[learning]{learn}
as much as possible from what it perceives.

To the extent that an agent relies on the prior knowledge of its
designer rather than on its own percepts, we say that the agent lacks
\newterm{autonomy}. 
A rational agent should be autonomous---it should learn what it can to
compensate for partial or incorrect prior knowledge.

First, however, we must think
about \newterm[task environment]{task environments}, which are
essentially the ``problems'' to which rational agents are the
``solutions.''

For the acronymically minded, we call this the
\newterm{PEAS} ({\bf P}erformance, {\bf E}nvironment, {\bf A}ctuators,
{\bf S}ensors) description.

Note that virtual
task environments can be just as complex as the ``real'' world:
for example, a \newterm[software agent]{software agents} (or software
robots or \newterm[softbot]{softbots}) that trades
on auction and reselling websites deals with millions of other users
and billions of objects, many with real images.

\newterm{Fully observable}
vs.~\newterm{partially observable}\ntindex{environment!partially
observable}: If an agent's sensors give it access to the complete
state of the environment at each point in time, then we say that the
task environment is fully observable.

If the agent has no sensors at all then the environment is
\newterm{unobservable}.

\newterm{Single agent} vs.~\newterm{multiagent}: 
The distinction between single-agent and multiagent environments may
seem simple enough.

Thus, chess is a
\newterm{competitive}
multiagent environment.

In the taxi-driving environment, on the other
hand, avoiding collisions maximizes the performance measure of all
agents, so it is a partially \newterm{cooperative} multiagent environment.

\newterm{Deterministic}\index{deterministic
environment}
vs.~\newterm{stochastic}.

One final note: our use of the word
``stochastic'' generally implies that uncertainty about outcomes is
quantified in terms of probabilities; a
\newterm{nondeterministic} environment is one in
which actions are characterized by their {\em possible} outcomes, but
no probabilities are attached to them.

\newterm{Episodic} 
vs.~\newterm{sequential}: In an episodic task environment, the agent's experience
is divided into atomic episodes.

\newterm{Static} vs.~\newterm{dynamic}: 
If the environment can change while an agent is deliberating, then we
say the environment is dynamic for that agent; otherwise, it is
static.

If
the environment itself does not change with the passage of time but
the agent's performance score does, then we say the environment is
\newterm{semidynamic}.

\newterm{Discrete} vs.~\newterm{continuous}:
The discrete/continuous distinction applies to the {\em state}
of the environment, to the way {\em time} is handled, and to the {\em
percepts} and {\em actions} of the agent.

\newterm{Known} vs.~\newterm{unknown}:
Strictly speaking, this distinction refers not to the environment
itself but to the agent's (or designer's) state of knowledge about the
``laws of physics'' of the environment.

Such
experiments are often carried out not for a single environment but
for many environments drawn from an \newterm{environment
class}.

For this reason, the
code repository also includes an \newterm{environment generator} for
each environment class.

The job of AI is to design an \newterm{agent
program} that implements the agent function---the
mapping from percepts to actions.

We assume this program will run on some
sort of computing device with physical sensors and actuators---we call
this the \newterm{architecture}:
 \mbox{\em agent} = \mbox{\em architecture} + \mbox{\em program}\ .

\subsection{Simple reflex agents}

The simplest kind of agent is the \newterm{simple reflex
agent}.

We call such a connection a
\newterm{condition--action
rule},\footnote{Also called
\term{situation--action rules}\tindex{rule!situation--action},
\term{productions}\tindex{production}, or \term{if--then rules}\tindex{rule!if--then}.}

Escape from infinite loops is possible if the agent can
\newterm[randomization]{randomize} its actions.

That is, the agent should maintain some sort of
\newterm{internal state} that depends on the
percept history and thereby reflects at least some of the unobserved
aspects of the current state.

An agent that
uses such models is called a \newterm{model-based 
agent}.

In other words, as well
as a current state description, the agent needs some sort of
\newterm{goal} information that describes situations
that are desirable---for example, being at a particular
destination.

Because
``happy'' does not sound very scientific, economists and computer
scientists use the term \newterm{utility}
instead.\footnote{The word ``utility'' here refers to ``the quality of
being useful,'' not to the electric company or waterworks.}

An agent's \newterm{utility
function} is essentially an internalization
of the performance measure.

Technically speaking, a rational utility-based agent chooses the
action that maximizes the \newterm{expected
utility} of the action outcomes---that is, the
utility the agent expects to derive, on average, given the
probabilities and utilities of each outcome.

The most important distinction is between the
\newterm{learning element}, which is responsible for making improvements, and
the \newterm{performance element}, which
is responsible for selecting external actions.

The learning element
uses feedback from the \newterm{critic}
on how the agent is doing and determines how the performance element
should be modified to do better in the future.

The last component of the learning agent is the \newterm{problem
generator}.


In an \newterm{atomic representation}\ntindex{atomic
representation} each state of the world is
indivisible---it has no internal structure.

A
\newterm{factored representation}\ntindex{factored
  representation} splits up each
state into a fixed set of
\newterm[variable]{variables} or \newterm[attribute]{attributes},
each of which can have a \newtermi{value}.

Instead, we would need a 
\newterm{structured representation}\ntindex{structured
  representation}, in which objects
such as cows and trucks and their various and varying relationships can be described
explicitly.

As we mentioned earlier, the axis along which atomic, factored, and structured
representations lie is the axis of increasing
\newterm{expressiveness}.

If there is a one-to-one mapping between concepts and memory locations, 
we call that a \newtermi{localist representation}.

But if the representation of a concept is spread over many memory locations, 
and each memory location is employed as part of the representation of multiple 
different concepts, we call that a \newtermi{distributed representation} \cite{Hinton:1986}.

The concept of a
\newterm{controller} in control
theory is identical to that of an agent in AI.

It has also
infiltrated the area of operating systems, where \newtermi{autonomic
computing} refers to computer systems and networks that monitor and
control themselves with a perceive--act loop and machine learning
methods~\cite{Kephart+Chess:2003}.

Noting that a collection of agent programs designed to work well together in a true multiagent environment
necessarily exhibits modularity---the programs share no internal state and communicate with each other only through the environment---it is common within the field of 
\newtermi{multiagent systems} to design the agent program of a single agent
as a collection of autonomous sub-agents.

Such an agent is
called a \newterm{problem-solving agent}, and
the process of looking for a path to a goal is called \newterm{search}.

With that information, the agent can follow
this four-phase problem-solving process:


\item \newterm{Goal formulation}: The
  agent adopts the \term{goal}\tindex{goal} of reaching
  Bucharest.

\item \newterm{Problem formulation}: The
  agent devises a \term{model}\tindex{model} of the world that
  describes the states and actions necessary to reach the goal.

\item \newterm{Search}: Before taking any action in the real world,
  the agent simulates sequences of actions in its model, searching
  until it finds a sequence of actions that reaches the goal.

Such a
  sequence is called a \newterm{solution}.

\item \newterm{Execution}: The agent can now execute the actions
  in the solution, one at a time.

Control theorists call this an
\newterm{open-loop} system: ignoring the percepts breaks the
loop between agent and environment.

If there is a chance that the model is
incorrect, or the environment is stochastic, then the agent would be safer
using a \newterm{closed-loop} approach that monitors the
percepts (see \secref{partially-observable-search-section}).


\subsection{Well-defined search problems and solutions}

A search \newterm{problem} can be defined formally as follows:

	
\item A set of possible \newterm{states} that the environment can be in.

We call this the \newtermi{state space}.

\item The \newterm{initial state} that the agent starts 
in.

\item A set of one or more \newtermi{goal states}.

\item The \newterm{actions} available to the agent.

We say that each of these actions is 
\newtermi{applicable} in \(s\).


\item A \newtermi{transition model}, that describes what each action does.
  
  
\item An \newtermi{action cost function}, \(c(s,a,s')\)\index{C@$c$ (action
cost)} that denotes the numeric cost of applying action \(a\) in state \(s\) to
reach state \(s'\).


A sequence of actions forms a
\newterm{path}, and a \term{solution}\tindex{solution} is a path
from the initial state to a goal state.

An \newtermi{optimal solution} has the lowest path cost among all
solutions.

 
The state space can be represented as a \newtermi{graph} in which the vertices
are states and the directed edges between them are actions.

The process of removing detail from a
representation is called \newterm{abstraction}.

Can we be more precise about the appropriate \newterm{level of abstraction}?

A \newtermi{standardized
problem}, such as the Rubik's Cube, is intended to
illustrate or exercise various problem-solving methods.

A
\newterm{real-world problem}, such as robot
navigation, is one whose solutions people actually use, and whose formulation
is idiosyncratic, not standardized, because, for example, each robot has
different sensors that produce different data.

\subsection{Standardized problems}

A \newterm{grid world} problem is a two-dimensional rectangular array of square
cells in which agents can move from cell to cell.


Another type of grid world is the \newterm{sokoban} puzzle, in which the
agent's goal is to push a number of boxes, scattered about the grid, to
designated storage locations.

In \newterm{sliding-block puzzles}, a number of
tiles (sometimes called blocks or pieces) are arranged in a grid with a number of blank spaces so
that some of the tiles can slide into the blank space.

One well-known variant
is the \newterm{8-puzzle} (see
\figref{8puzzle-figure}), which consists of a 3\(\times\)3 grid with eight
numbered tiles and one blank space.

\newterm[Touring problem]{Touring problems},
such as  ``Visit every city in
\figref{romania-distances-figure}, starting and ending
in Bucharest'' are similar to route-finding problems, but each state must include not just the current location but
also the {\em set of cities the agent has visited}.

The \newterm{traveling salesperson problem}\ntindex{traveling
salesperson problem}
(TSP) is a touring
problem in which each city must be visited exactly once.

A \newterm{VLSI layout} problem 
requires positioning millions of components and connections on a chip
to minimize area, minimize circuit delays, minimize stray capacitances,
and maximize manufacturing yield.

\newterm{Robot navigation}\index{problem!robot
navigation} is a generalization of the route-finding problem described earlier.

\newterm{Automatic assembly sequencing} of
complex objects (such as electric motors) by a robot has been standard industry
practice since the 1970s.

One important assembly problem is \newtermi{protein design}, in which
the goal is to find a sequence of amino acids that will fold into a
three-dimensional protein with the right properties to cure some disease.


\section{Search Algorithms}

A \newtermi{search algorithm} takes a search problem as input and returns a
solution, or an indication of failure.

In this chapter we consider algorithms
that superimpose a \newterm{search tree} over the
state-space graph, forming various paths from the initial state, trying to extend a path to reach a goal state.

We can \newterm{expand} the node,
thereby \newterm{generating} a set of three new
\newtermi{successor nodes} in the search tree.

That is, for each applicable
action, we get a resulting state ({\it Sibiu, Timisoara,} and {\it Zerind}),
for which we create a new \newterm{child node} in the
search tree, where each node has {\it Arad} as its \newtermi{parent node}.

We call this the \newtermi{frontier} of the search
tree.

We say that any state that has had a node generated for it has been
\newtermi{reached} (whether or not that node has been expanded).\footnote{Some
authors call the frontier the \termi{open list}},
which is both geographically less evocative and computationally less
appropriate, because a queue is more efficient than a list here.

Note that the frontier \newterm[separator]{separates} two regions of the
state-space graph: an interior region where every state has been expanded, and
an exterior region of states that have not yet been reached.

A very general
approach is called \newtermi{best-first search}, in which we choose the node,
\(n\), with the minimum value of some \newtermi{evaluation function}, \(f(n)\).

A  \newtermi{node} in the tree is represented by a data structure with four components: 

\item \var{node}.\prog{State}: the state to which the node corresponds;
\item \var{node}.\prog{Parent}: the node in the tree that generated this node;
\item \var{node}.\prog{Action}: the action that was applied to the parent's state to generate this node;
\item \var{node}.\prog{Path-Cost}: the total cost of the path from the initial 
state to this node.

The appropriate
choice is a \newtermi{queue} of some kind, because the operations on a frontier
are:


\item \prog{Is-Empty}(\var{frontier}) returns true only if there are no nodes in the frontier.


Three kinds of queues are used in search algorithms:

\item A \newtermi{priority queue} pops first 
the node with the minimum cost according to some evaluation function, \(f\).

\item A \newtermi{FIFO queue} or first-in-first-out queue
pops first the node that was added to the queue first; we shall  see it is used
in breadth-first search.

\item A \newterm{LIFO queue} or last-in-first-out queue (also known as a \termi{stack}), which pops first the most recently added node; we shall see it is
used in depth-first search.

We say that {\it Arad} is a \newterm{repeated
state} in the search tree, generated in this case by a
\newtermi{cycle} (also known as a \newterm{loopy path}).

A cycles is a special case of a \newterm{redundant path}.

We call a search
algorithm a \newtermi{graph search} if it checks
for redundant paths and a \newtermi{tree-like search}\footnote{We say
``tree-like search'' rather than ``tree search'' because the state space is still
the same graph no matter how we search it; we are just choosing to treat it as
if it were a tree.}

We can evaluate an algorithm's
performance in four ways:

    
\newtermitem{Completeness}\ntindex{completeness!search algorithm@of a search
algorithm} Is the algorithm guaranteed to find a solution when there is one,
and to correctly report failure when there is not?

\newtermitem{Cost optimality} Does
it find a solution with the lowest path cost of all solutions?\footnote{Some
authors use the term ``admissibility'' for the property of finding the
lowest-cost solution, and some use just ``optimality,'' but that can be
confused with optimal efficiency.}

\newtermitem{Time complexity} How long does it take to
find a solution?

\newtermitem{Space complexity} How much memory is
needed to perform the search?

To be complete, a search algorithm must be
\newtermi{systematic} in the way it explores an infinite state space, making
sure it can eventually reaches any state that is connected to the initial
state.

For an implicit state space, complexity can be measured in
terms of \(d\), the \newterm{depth} or number of actions in an optimal
solution; \(m\), the maximum number of actions in any path; and \(b\), the
\newterm{branching factor} or number of successors of
a node that need to be considered (after eliminating redundant paths).

\subsection{Breadth-first search}


When all actions have the same cost, an appropriate strategy is
\newtermi{breadth-first search}, in which the root node is expanded first, then
all the successors of the root node are expanded next, then {\em their}
successors, and so on.

That also means we can do an \newtermi{early goal test}, checking
whether a node is a solution as soon as it is {\em generated}, rather than the
\termi{late goal test} that best-first search uses, waiting until a node is
popped off the queue.

This is called
Dijkstra's algorithm by the theoretical computer science community, and
\newtermi{uniform-cost search} by the AI community.


\subsection{Depth-first search and the problem of memory}

\newterm{Depth-first search} always expands the
{\em deepest} node in the frontier first.

A variant of depth-first search called \newterm{backtracking
search}\nindex{backtracking search}\nindex{search,
backtracking} uses even less memory.


\subsection{Depth-limited and iterative deepening search}


To keep depth-first search from wandering down an infinite path, we can use
\newterm{depth-limited search}, a version of
depth-first search in which we supply a depth limit, $\ell$, and treat all
nodes at depth $\ell$ as if they had no successors (see
\figref{recursive-dls-algorithm}).

This number, known as the \newterm{diameter}\ntindex{diameter (of a
graph)} of the state space graph, gives us a better depth
limit, which leads to a more efficient depth-limited search.


\newterm{Iterative deepening search} solves
the problem of picking a good value for $\ell$ by trying all values: first 0,
then 1, then 2, and so on---until either a solution is found, or the
depth-limited search returns the \var{failure} value rather than the
\var{cutoff} value.

An alternative approach called
\newtermi{bidirectional search} simultaneously
searches forward from the start state and backwards from the goal state(s),
keeping track of two frontiers and two tables of {\em reached} states.


\section{Informed (Heuristic) Search Strategies}


This section shows how an \newterm{informed
search} strategy---one that
uses domain-specific hints about the location of goals---can find solutions
more efficiently than an uninformed strategy.

The hints come in the form of a
\newtermi{heuristic function}, denoted
\(h(n)\):\footnote{It may seem odd that the
heuristic function operates on a node, when all it really needs is the node's
state.

\subsection{Greedy best-first search}


\newterm{Greedy best-first search} is a form
of best-first search that expands first the node with the lowest \(h(n)\)
value---the node that appears to be closest to the goal---on the grounds that
this is likely to lead to a solution quickly.

Let us see how this works for route-finding problems in Romania; we use the
\newterm{straight-line distance}\ntindex{straight-line
distance} heuristic, which we will call
\(h_{\J{SLD}}\).

\subsection{A{\star} search}


The most common	informed search algorithm is \newterm{A{\star}
search} (pronounced ``A-star search''), a best-first search that uses
the evaluation function
 
  f(n) = g(n) + h(n)

where \(g(n)\) is the path cost from the initial state to  
node \(n\), and \(h(n)\) is the {\em estimated} cost of the shortest path from 
\(n\) to a goal state, so we have

f(n) = \mbox{ estimated cost of the best path that continues from \(n\) to a goal.}

A key property is \termi{admissibility}: an
\newtermi{admissible heuristic} is one that {\em never overestimates} the cost
to reach a goal.

\smallskip

A slightly stronger property is called
\newterm{consistency}. A
heuristic \(h(n)\) is consistent if, for every node \(n\) and every successor
\(n'\) of \(n\) generated by an action \(a\), we have:


   h(n) \le c(n,a,n') + h(n') \ .


This is a form of the \newterm{triangle inequality}\ntindex{triangle
inequality}, which stipulates that a side of a triangle cannot be longer than
the sum of the other two sides (see \figref{triangle-inequality-figure}).

}


\subsection{Search contours}

A useful way to visualize a search is to draw
\newterm[contour]{contours} in the state
space, just like the contours in a topographic map.

It should be clear that as you extend a path, the \(g\) costs are
\newtermi{monotonic}: the path cost always increases as you go along a path,
because action costs are always positive.\footnote{Technically, we say
``strictly monotonic'' for costs that always increase, and ``monotonic'' for
costs that never decrease, but might remain the same.}

We say these are \newtermi{surely expanded nodes}.

 
We say that A{\star} with a consistent heuristic is \newterm{optimally
efficient} in the sense that any
algorithm that extends search paths from the initial state, and uses the same
heuristic information, must expand all nodes that are \newtermi{surely
expanded} by A(\star) (because any one of them could have been part of an
optimal solution).

A{\star} is efficient because it
\newterm[pruning]{prunes} away search tree
nodes that are not necessary for finding an optimal solution.


We can
explore fewer nodes (taking less time and space) if we are willing to accept
solutions that are suboptimal, but are ``good enough''---what we call
\newterm{satisficing} solutions.

If we allow A{\star}
search to use an \newtermi{inadmissible heuristic}---one that may
overestimate---then we risk missing the optimal solution, but the heuristic can
potentially be more accurate, thereby reducing the number of nodes expanded.

For example, road engineers know the concept of a \newtermi{detour index},
which is a multiplier applied to the straight-line distance to account for the
typical curvature of roads.

We can apply this idea to any problem, not just ones involving roads, with an
approach called \newterm{weighted A{\star} search}
search}} where we weight the heuristic value
more heavily, giving us the evaluation function \(f(n) = g(n) + W \times
h(n)\), for some \(W > 1\).

In \newtermi{bounded
suboptimal search}, we look for a solution that is guaranteed to be within a
constant factor \(W\), of the optimal cost.

In \newtermi{bounded cost search}, we look for a solution whose cost
is less than some constant \(C\).

And in \newtermi{unbounded cost search}, we
accept a solution of any cost, as long as we can find it quickly.

An example of an unbounded cost search algorithm is \newtermi{speedy search},
which is a version of greedy best-first search that uses as a heuristic the
estimated number of actions required to reach a goal, regardless of the cost of
those actions.

For other problems, we can keep \newtermi{reference counts} of the number of
times a state has been reached, and remove it from the \var{reached} table when
there are no more ways to reach the state.

\smallskip

\newtermi{Beam search} keeps a strict constant limit on the size of the
frontier, only keeping the \(k\) nodes with the best \(f\)-scores, and
discarding any other expanded nodes.

\newterm{Iterative-deepening A{\star} search} (IDA{\star}) is to A{\star} what
iterative-deepening search is to breadth-first: IDA{\star} gives us the benefits of A{\star} without the requirement to keep all reached states in memory, at a cost of visiting some states multiple times.


\newterm{Recursive best-first search} (RBFS) attempts to mimic the 
operation of standard best-first search,
but using space that is only linear in the length of the longest path.

As the recursion unwinds, RBFS replaces the
\(f\)-value of each node along the path with a \newterm{backed-up value}---the best \(f\)-value of its
children.

Two
algorithms that do this are
\newterm[MA*]{MA{\star}} (memory-bounded
A{\star}) and \newterm[SMA*]{SMA{\star}} (simplified MA{\star}).

(This resembles the
problem of \newterm{thrashing} in disk paging systems.)

This is called a \newtermi{front-to-end} search.

An alternative, called
\newtermi{front-to-front} search, attempts to estimate the distance to the
other frontier.

This is sometimes
called the \termi{city-block distance} or \newterm{Manhattan
distance}.

\subsection{The effect of heuristic accuracy on performance}

One way to characterize the quality of a heuristic is the
\newterm{effective branching factor}\ntindex{branching
factor!effective} \(b^*\). If the
total number of nodes generated by A{\star} for a particular problem is
\(N\) and the solution depth is \(d\), then \(b^*\) is the branching factor
that a uniform tree of depth \(d\) would have to have in order to
contain \(N+1\) nodes.

Another way to think about the effect of A{\star} pruning is in reducing the
\newtermi{effective depth}---the average depth of all paths explored by a
search---rather than the effective branching factor.

We thus say that
\(h_2\) \newterm[domination]{dominates}
\(h_1\).

A problem with fewer restrictions on the actions is called
a \newterm{relaxed problem}.
The state-space graph of the relaxed problem is a {\em supergraph} of
the original state space because the removal of restrictions creates
added edges in the graph.

\subsection{Generating heuristics from subproblems: Pattern databases}

Admissible heuristics can also be derived from the solution cost of a
\newterm{subproblem} of a given problem.


The idea behind \newterm[pattern database]{pattern databases}\ntindex{pattern
database} is to store these exact solution costs for every possible subproblem
instance---in our example, every possible configuration of the four tiles and
the blank.

This is the idea behind
\newterm{disjoint pattern databases}.

There are many tricks, but the most important one is
\newtermi{precomputation} of some optimal path costs.

A better approach is to choose a few (perhaps 10 or 20) \newtermi{landmark
points}\footnote{``Landmark points'' are sometimes called ``pivots'' or
``anchors.''}

Some route-finding algorithms save even more time by adding
\newterm{shortcuts}---artificial edges in the graph that
define an optimal multi-action path.

But with a bit more care, we can
come up with a heuristic that is both efficient and admissible:

h_\J{DH}(n) = \max_{L \elt \J{Landmarks}} |C^*(n, L) - C^*(\J{goal}, L)|

This is called a \newtermi{differential heuristic} (because of the
subtraction).

The answer is yes,
and the method rests on an important concept called the \newtermi{metalevel
state space}.

Each state in a metalevel state
space captures the internal (computational) state of a program that is
searching in an \newterm{object-level state space}\ntindex{object-level state
space} such as Romania.

For harder
problems, there will be many such missteps, and a \newterm{metalevel
learning} algorithm can learn from these
experiences to avoid exploring unpromising subtrees.

Some machine learning techniques work better when supplied with
\newterm[feature]{features} of a state that are
relevant to predicting the state's value, rather than with just the raw state
description.

The original version of RBFS\index{search!recursive best-first
(RBFS)}~\cite{Korf:1993} is actually somewhat more complicated than the
algorithm shown in \figref{rbfs-algorithm}, which is actually closer to an
independently developed algorithm called \newtermi{iterative
expansion}~\cite{Russell:1992}.

\newterm{Local search} algorithms
operate by searching from a start state to neighboring states, without keeping
track of the paths, nor the set of states that have been reached.

Local search algorithms can also solve \newterm[optimization
problem]{optimization problems}, in which the aim
is to find the best state according to an \newterm{objective
function}.

To understand local search, consider the states of a problem laid out in a
\newterm{state-space landscape}, as shown
in \figref{hill-climbing-figure}.

If elevation corresponds to an objective function, then the aim is to find the
highest peak---a \newterm{global maximum}---and we call
the process \term{hill climbing}.

If elevation corresponds to cost, then the
aim is to find the lowest valley---a \newterm{global
minimum}---and we call it \term{gradient descent}.


\subsection{Hill-climbing search}

The \newterm[hill climbing]{hill-climbing} search
algorithm is shown in \figref{hill-climbing-algorithm}.

It keeps track of one
current state and on each iteration moves to the neighboring state with highest
value---that is, it heads in the direction that provides the \newtermi{steepest
ascent}.

We will use a \newtermi{complete-state formulation},
which means that every state has all the components of a solution, but they
might not all be in the right place.

Hill climbing is sometimes called \newterm{greedy local
search} because it grabs
a good neighbor state without thinking ahead about where to go next.

Unfortunately, hill climbing can get stuck for any of the following reasons:

	
\item \newterm[local maximum]{Local maxima}: 
A local maximum is a peak that is higher than each of
its neighboring states but lower than the global maximum.

\item \newterm[ridge]{Ridges}: A ridge is shown in \figref{ridge-figure}.

\item \newterm[plateau]{Plateaus}: A plateau is a flat area of the state-space landscape.

It can be a flat local maximum,
from which no uphill exit exists, or a \newterm{shoulder}, from which
progress is possible.

One answer is to keep going when we reach a
plateau---to allow a \newterm{sideways move}\ntindex{sideways move (in state
space)} in the hope that the plateau is really a shoulder, as shown in
\figref{hill-climbing-figure}.

\newterm{Stochastic hill climbing} chooses at random from among the
uphill moves; the probability of
selection can vary with the steepness of the uphill move.

\newterm{First-choice hill climbing}
implements stochastic hill climbing by generating successors randomly
until one is generated that is better than the current state.

Another variant is \newtermi{random-restart hill
climbing}, \ntindex{hill
climbing!random-restart} which adopts the adage, ``If
at first you don't succeed, try, try again.''

\newterm{Simulated annealing}\index{simulated 
annealing} is such an algorithm.

To explain simulated annealing,
we switch our point of view from hill climbing to \newterm{gradient
descent} (i.e., minimizing cost) and imagine
the task of getting a ping-pong ball into the deepest crevice in a
bumpy surface.

The \newterm{local beam
search}
algorithm keeps track of \(k\) states rather than just one.

A variant called
\newterm{stochastic beam search}\ntindex{search!stochastic
beam}, analogous to stochastic
hill climbing, helps alleviate this problem.


\subsection{Evolutionary algorithms}


\newterm{Evolutionary algorithms} can be seen
as variants of stochastic beam search that are explicitly motivated by the
metaphor of natural selection in biology: there is a population of individuals
(states), in which the fittest (highest value) individuals produce offspring
(successor states) that populate the next generation, a process called
\newtermi{recombination}.

In 
	\newtermi{genetic algorithms}, each individual is a string over a finite alphabet
	(often the Boolean alphabet {\bf 01}),
	just as DNA is a string over the alphabet {\bf ACGT}.

In \newtermi{evolutionary strategies},
	an individual is a sequence of real numbers, and in
	\newtermi{genetic programming}
	an individual is a program, often represented as a syntax tree.

\item The \newterm{selection}\ntindex{selection (in evolutionary
    algorithms)} process for selecting the individuals who will become the
    parents of the next generation: one possibility is to select from all
    individuals with probability proportional to their fitness score.

One common approach (assuming \(\rho =
    2\)), is to randomly select a \newtermi{crossover point} to split each of
    the parent strings, and recombine the parts to form two children, one with
    the first part of parent 1 and the second part of parent 2; the other with
    the second part of parent 1 and the first part of parent 2.

\item The \newtermi{mutation rate}, which determines how often offspring
have random mutations to their representation.

This can be just the newly formed
offspring, or it can include a few top-scoring parents from the previous
generation (a practice called \newtermi{elitism}, which guarantees that overall
fitness will never decrease over time).

The theory of genetic algorithms explains how this works using the
idea of a \newterm{schema},
which is a substring in which some of the positions can be left
unspecified.

Strings that match the schema (such as 24613578) are
called \newterm[instance]{instances} of the schema.

This is a {\em six-dimensional} space; we also say that
states are defined by six \newterm[variable]{variables}\ntindex{variable 
(in continuous state space)}.

One way to deal with a continuous state space is to
\newterm[discretization]{discretize} it.

Methods that measure progress by the change in fitness function between two
nearby points are called \newterm{empirical
gradient} methods.

Many
methods attempt to use the \newterm{gradient} of the
landscape to find a maximum.


Given a locally correct expression for the gradient, we can perform steepest-ascent
hill climbing by updating the current state according to the formula

  \x \leftarrow \x + \alpha \nabla f(\x)\ ,

where \(\alpha\) is a small constant often called the \newtermi{step size}.

The technique of
\newtermi{line search} tries to overcome this dilemma by extending the
current gradient direction---usually by repeatedly doubling
\(\alpha\)---until \(f\) starts to decrease again.

For many problems, the most effective algorithm is the venerable 
\newterm{Newton--Raphson}
method.

Thus, \(g(x)\)
in Newton's formula becomes \(\nabla f(\x)\), and the update equation can
be written in matrix--vector form as 

  \x \leftarrow \x - \mbf{H}^{-1}_f(\x) \nabla f(\x)\ ,

where \(\mbf{H}_f(\x)\) is the \newterm{Hessian} matrix of second
derivatives, whose elements
\(H_{ij}\) are given by \(\partial^2 f/\partial x_i \partial x_j\).

A final topic is \newterm{constrained
optimization}.

The best-known category
is that of \newterm{linear
programming} problems, in which
constraints must be linear inequalities forming a \newterm{convex
set}\footnote{A set of points \({\cal
S}\) is convex if the line joining any two points in \({\cal S}\) is also
contained in \({\cal S}\).

It is a special case of the
more general problem of \newterm{convex
optimization}, which allows the
constraint region to be any convex region and the objective to be any
function that is convex within the constraint region.

We call a set of physical states that the agent believes
are possible a \newtermi{belief state}.


In partially observable and nondeterministic environments, the solution to a
problem is no longer a sequence, but rather a \newtermi{conditional plan}
(sometimes called a contingency plan or a strategy) that specifies what to do
depending on what the percepts tell the agent during the execution of the plan.

In the \newterm{erratic vacuum
world}, the \act{Suck} action
works as follows:

	
\item When applied to a dirty square the action cleans the square and sometimes 
cleans up dirt in an adjacent square, too.

We call these nodes \newterm[or node]{{\footnotesize OR}
nodes} node}|bi}.

We call these
nodes \newterm[and node]{{\footnotesize AND} nodes}
node}|bi}.

These two kinds of nodes alternate, leading to an \newterm[and--or
tree]{{\footnotesize AND--OR} tree}
tree}|bi} as illustrated in \figref{erratic-vacuum-and-or-plan-figure}.

There is, however, a
\newtermi{cyclic solution}, which is to keep trying \(\J{Forward}\) until it
works.

We can express this with a new {\bf while} construct: 

 [\act{Suck}, \whilestep{\J{State}\eq 5}{\J{Forward}}, \J{Suck}] 

or by adding a \newterm{label} to denote some 
portion of the plan
and referring to that label later: 

[\act{Suck}, L_1:\ \J{Forward}, \condstep{\J{State}\eq 5}{L_1}{\J{Suck}}]\ .

\subsection{Searching with no observation}


When the agent's percepts provide {\em no information at all}, we have what is
called a \newterm{sensorless} problem or sometimes
a \newterm{conformant} problem.

We say that the agent
can \newterm[coercion]{coerce} the world into state 7.

Instead, we can look {\em inside} the belief states and develop
\newterm{incremental belief-state search}\ntindex{search!incremental
belief-state} algorithms that build up the solution one physical state at a
time.

This function goes under various
names, including \newtermi{monitoring}, \newtermi{filtering} and
\newtermi{state estimation}.

The example concerns a robot with a particular state estimation task
called \newtermi{localization}:
working out where it is, given a map of the world and a sequence of
percepts and actions.


\section{Online Search Agents and Unknown Environments}


So far we have concentrated on agents that use \newtermi{offline search}
algorithms.

In contrast, an \newterm{online
search}\footnote{The term ``online'' here refers 
to algorithms that must process input as it is
received rather than waiting for the entire input data set to become available.

In unknown environments, where
the agent does not know what states exist or what its actions do, the agent
faces an \newterm{exploration problem} and must use its
actions as experiments in order to learn.

In the language
of online algorithms, this is called the
\newterm{competitive ratio}; we would like it to be as small as possible.


Online explorers are vulnerable to \newterm[dead end]{dead ends}: 
states from which no goal state is reachable.

This is an
example of an \newterm{adversary argument}---we can imagine an
adversary constructing the state space while the agent explores it
and putting the goals and dead ends wherever it chooses.

Dead ends are a real difficulty for robot exploration---staircases,
ramps, cliffs, one-way streets, and even natural terrain all present 
states from which some actions are 
\newterm{irreversible}---there is no way to return to the previous state.

The exploration algorithm we will present is only guaranteed to work in state spaces that are \newterm{safely
explorable}---that is, some goal state is reachable from every
reachable state.

Instead of random restarts, one might consider using a \newterm{random
walk} to explore the environment.

An agent implementing this scheme, which is called learning real-time A{\star}
(\newterm[LRTA{\star}]{LRTA{\star}}}), is shown in
\figref{lrta-agent-algorithm}.

This \newterm{optimism under
uncertainty} encourages the agent to explore new, possibly promising
paths.

In the field of operations research, a variant of hill climbing called
\newterm{tabu search} has gained
popularity~\cite{Glover+Laguna:1997}.

\citeA{Gomes+al:1998} showed that the run
times of systematic backtracking algorithms often have a \newtermi{heavy-tailed
distribution}, which means that the probability of a very long run time is more
than would be predicted if the run times were exponentially distributed.

In the
1950s, several statisticians, including Box~\citeyear{Box:1957} and
Friedman~\citeyear{Friedman:1959}, used evolutionary techniques for optimization
problems, but it wasn't until Rechenberg~\citeyear{Rechenberg:1965} introduced
\newterm[evolution strategy]{evolution strategies} to solve
optimization problems for airfoils that the approach gained popularity.

The \newterm{artificial
life} movement~\cite{Langton:1995} takes this idea one step
further, viewing the products of genetic algorithms as {\em organisms} rather than
solutions to problems.

The field of \newterm{genetic programming} is a subfield of
genetic algorithms in which the representations are programs rather than bit strings.

The more general problem of exploring \newterm[Eulerian graph]{Eulerian
graphs} (i.e., graphs in which each node has equal numbers of
incoming and outgoing edges) was solved by an algorithm due to \citeA{Hierholzer:1873}.

These search problems are called \newtermi{dynamic search problems}.

The LRTA{\star} algorithm was developed by
\citeA{Korf:1990} as part of an investigation into \newterm{real-time
search} for environments in which the agent
must act after searching for only a fixed amount of time (a
common situation in two-player games).

\noindent In this chapter we cover \termi{competitive environments}, in which
two or more agents have conflicting goals, giving rise to \newtermi{adversarial
search} problems.

The first stance, appropriate when there is a very large number of agents, is
to consider them in the aggregate as an \newtermi{economy}, allowing us to do
things like predict that increasing demand will cause prices to rise, without
having to predict the action of any individual agent.

We show that
\newtermi{pruning} makes the search more efficient by ignoring portions of the
search tree that make no difference to the optimal move.

\secref{backgammon-section} discusses games that include an element of chance
(through rolling dice or shuffling cards) and
\secref{partially-observable-game-section} covers games of \newterm{imperfect
information} (such as poker
and bridge, where not all cards are visible to all players).

\subsection{Two-player zero-sum games}

The games most commonly studied within AI (such as chess and Go) are what game
theorists call deterministic, two-player, turn-taking, \newtermi{perfect
information}, \newterm{zero-sum
games}. ``Perfect information'' is a synonym for ``fully
observable,''\footnote{Some authors make a distinction, using ``imperfect
information game'' for one like poker where each player gets private
information about their own hand that the other players do not have, and
``partially observable game'' to mean one like StarCraft where each player can
see what the environment around them, but not the environment far away.}

\item \prog{Is-Terminal}\((s)\): A \newterm{terminal test},
which is true when the game is over and false otherwise.

States where the game has ended are called 
\newtermi{terminal states}.


Much as in \chapref{search-chapter}, the initial state, \noprog{Actions}
function, and {\Result} function define the \newtermi{state space graph}---a
graph where the vertexes are states and edges are moves and a state might be
reached by multiple paths, and we can superimpose a \newtermi{search tree} over
part of that graph to determine what move to make.

We define the \newterm{game
tree} as a complete search tree, following every sequence of moves all the way
to a terminal state.

For games we often use the term \newtermi{move} as a
synonym for ``action'' and \newtermi{position} as a synonym for ``state.''

But for games with multiple outcome scores, we use a generalization
called \newtermi{minimax search}.

(Note: In some games, the word ``move'' means that both players have
taken an action; therefore the word \newtermi{ply} is used to unambiguously
mean one move by one player, bringing us one level deeper in the game tree.)

Given a game tree, the optimal strategy can be determined by working out the
\newterm{minimax value} of each state in the tree, which
we write as \(\noprog{Minimax}(s)\).

We can also identify the \newterm{minimax
decision} at the root: action \(a_1\) is the optimal
choice for {\sc max} because it leads to the state with the highest minimax
value.

Multiplayer games usually involve
\newterm[alliance]{alliances}, whether
formal or informal, among the players.

The particular technique we examine is called
\newterm{alpha--beta pruning}.

The best moves are
often called \newterm[killer moves]{killer moves} and to try them first is called
the killer move heuristic.

In game tree search,
repeated states can occur because of
\newterm[transposition]{transpositions}\ntindex{transposition (in a
game)}---different permutations of the move sequence that end up in the same
position, and the problem can be addressed with a \newtermi{transposition
table} that caches the heuristic value of states.

In the very first paper on computer game-playing, {\em
Programming a Computer for Playing Chess} \cite{Shannon:1950}, Claude Shannon
recognized this problem and proposed two strategies: a \newtermi{Type A
Strategy} considers all possible moves to a certain depth in the search tree,
and then uses a heuristic evaluation function to estimate the utility of states
at that depth.

A
\newtermi{Type B Strategy} ignores moves that look bad, and follows promising
lines ``as far as possible.''


\section{Heuristic Alpha--Beta Tree Search}


To make use of our limited computation time, we can 
cut off the search early and apply a heuristic \newtermi{evaluation
function} to states, effectively treating nonterminal
nodes as if they were terminal.

In other words, we replace the \prog{Utility} function
with  \prog{Eval}, which estimates a
state's utility, and replace the terminal test by a \newterm{cutoff
test}, which must return true for terminal states, but is otherwise free to
decide when to cut off the search, based on the search depth and any property of the state that it chooses to consider.

Then a
reasonable evaluation for states in the category is the \newterm{expected
value}: \(({0.82}\times +1) + ({0.02}
\times 0) + ({0.16} \times 1/2) = {0.90}\).

For
example, introductory chess
books give an approximate \newterm{material value} for
each piece: each pawn is worth 1, a knight or bishop is worth 3, a rook 5, and
the queen 9.

Mathematically, this kind of
evaluation function is called a \newterm{weighted linear
function} because it can be
expressed as
\noprog{Eval}(s) = w_1f_1(s) + w_2f_2(s) + \cdots + w_n f_n(s)
    = \sum_{i=1}^n{w_if_i(s)}\ ,  
where each \(f_i\) is a feature of the
position (such as ``number of white bishops'') and each \(w_i\) is a weight (saying how important that feature is).

The evaluation
function should be applied only to positions that are
\newterm[quiescence]{quiescent}---that is, positions
in which there is no pending move (such as a capturing the queen) 
that would wildly swing the evaluation.

This extra
\newterm{quiescence search} is sometimes
restricted to consider only certain types of moves, such as capture
moves, that will quickly resolve the uncertainties in the position.

The \newterm{horizon effect} is more difficult to
eliminate.

One strategy to mitigate the horizon effect is to allow \newtermi{singular
extensions}, moves that are ``clearly better'' than all other moves in a given
position, even when the search would normally be cut off at that point.


\subsection{Forward pruning}

Alpha--beta pruning prunes branches of the tree that can have no effect on the
final evaluation, but \newtermi{forward pruning}
prunes moves that appear to be poor moves, but might possibly be good ones.

Another technique, \newtermi{late move reduction}, works under the assumption
that move ordering has been done well, and therefore moves that appear later in
the list of possible moves are less likely to be good moves.

A
computer, on the other hand, can completely {\it solve} the endgame by
producing a \newtermi{policy}, which is a mapping from every possible state to
the best move in that state.

The table is constructed by \newtermi{retrograde}
minimax search: start by considering all ways to place the KBNK pieces on the
board.

In response to these two
challenges, modern Go programs have abandoned alpha--beta search and instead
use a strategy called \newtermi{Monte Carlo tree search}
(MCTS).\footnote{``Monte Carlo'' algorithms are randomized algorithms named
after the famed gambling casino in Monaco.}

Instead, the value of a
state is estimated as the average utility over a number
of \newtermi[simulation]{simulations} of complete games starting from the
state.

A simulation (also called a \newtermi{playout} or \newtermi{rollout})
chooses moves first for one player, than for the other, repeating until a
terminal position is reached.

To get useful information from
the playout we need a \newtermi{playout policy} that biases the moves towards
good ones.

The
simplest answer, called \newtermi{pure Monte Carlo search}, is to do \(N\)
simulations starting from the current state of the game, and track which of the
possible moves from the current position has the highest win percentage.

For
some stochastic games this converges to optimal play as \(N\) increases, but
for most games it is not sufficient---we need a \newtermi{tree selection
policy} that selectively focuses the computational resources on the important
parts of the game tree, balancing two factors: \newtermi{exploration} of states
that have had few playouts, and \newtermi{exploitation} of states that have
done well in past playouts.

The most widely-used selection policy is called Upper Confidence bounds applied
to Trees (\newtermi{UCT}).

The policy ranks each possible move based on a upper
confidence bound formula called \newtermi{UCB1}.

For example, in games that can last many moves, we may want to use
\newtermi{early playout termination} in which we stop a playout that is taking
too many moves, and either evaluate it with a heuristic evaluation function, or
just declare it a draw.

The general idea of
simulating moves into the future, observing the outcome, and using the
outcome to determine which moves are good ones is one kind of
\newtermi{reinforcement learning}, which is covered in
\chapref{reinforcement-learning-chapter}.

We call
these \newtermi{stochastic games}.

A
game tree in backgammon must include \newterm{chance
nodes} in addition to {\sc max} and
{\sc min} nodes.

Instead, we can only calculate the \newterm{expected
value}\tindex{expected value (in a game tree)} of a position: the
average over all possible outcomes of the chance nodes.

This leads us to generalize the \termi{minimax value} for deterministic
games to an \newterm{expectiminimax value} for
games with chance nodes.


We will examine
the game of \newtermi{Kriegspiel}, a partially observable variant of chess in
which pieces are completely invisible to the opponent.

For Kriegspiel, a winning
strategy, or \newterm{guaranteed
checkmate}, is one that, for each
possible percept sequence, leads to an actual checkmate for every
possible board state in the current belief state, regardless of how
the opponent moves.

In addition to guaranteed checkmates, Kriegspiel admits an entirely new concept
that makes no sense in fully observable games: \newterm{probabilistic
checkmate}.

Trying such a strategy may succeed,
leading to an \newterm{accidental
checkmate}---accidental in the sense that White
could not {\em know} that it would be checkmate---if Black's pieces happen to
be in the right places.

Because it assumes that every future state will automatically be one of
perfect knowledge, the clairvoyance approach never selects actions that {\em
gather information} (like the first move in \figref{kriegspiel-krk-figure});
nor will it choose actions that hide information from the opponent or provide
information to a partner because it assumes that they already know the
information; and it will never \newtermi{bluff} in
poker,\footnote{Bluffing---betting as if one's hand is good, even when it's
not---is a core part of poker strategy.}

This kind of reasoning about what computations to do is called
\newterm{metareasoning} (reasoning about reasoning).


The \newterm{minimax} algorithm is
traced to a 1912 paper by Ernst Zermelo, the developer of modern set
theory\nocite{Zermelo:1912}.


John McCarthy conceived the idea of \newterm{alpha--beta}
search in 1956, although the idea did not appear in
print until later \cite{Hart+Edwards:1961}.


\citeA{Koller+Pfeffer:1997b} describe a system for completely solving
\newterm{partially observable} games.


\newtermi{Chess} milestones are marked by successive winners of the Fredkin
Prize: \system{Belle} \cite{Condon+Thompson:1982} for
being the first program to achieve master status, \system{Deep Thought}
\cite{Hsu+al:1990} for reaching international master status, and \system{Deep
Blue} \cite{Campbell+al:2002,Hsu:2004} for defeating world champion Garry
Kasparov\nindex{Kasparov, G.} in a 1997 exhibition match.

These programs have reduced the
effective branching factor to less than 3 (compared with the actual branching
factor of about 35) using pruning techniques such as the \newterm{null
move} heuristic, which generates a good lower
bound on the value of a position, using a shallow search in which the opponent
gets to move twice at the beginning.

Also important is \newterm{futility
pruning}, which helps
decide in advance which moves will cause a beta cutoff in the successor nodes.


\newterm{Checkers} was the first of the classic games played by
a computer~\cite{Strachey:1952}.


 \newtermi{Othello}, also called \indextext{Reversi}, has a
smaller search space than chess, but defining an
evaluation function is difficult, because material advantage is not as
important as mobility.

\newterm{Backgammon}\tindex{backgammon}, a game of chance, was analyzed
 mathematically by Gerolamo Cardano \citeyear{Cardano:1663}, and taken up
 for computer play with the BKG\index{BKG (backgammon
 program)} program~\cite{Berliner:1980}, which used a manually constructed
 evaluation function and searched only to depth 1.

\newterm{Poker}\tindex{poker}, like Go, has seen surprising advances in recent
years.

\newterm{Bridge}\tindex{bridge (card game)}: \citeA{Smith+al:1998} report on
how \system{Bridge Baron} won the 1998 computer bridge championship, using
hierarchical plans (see \chapref{advanced-planning-chapter}) and high-level
actions, such as \term{finessing} and \term{squeezing}, that are familiar to
bridge players.


\newtermi{Scrabble} is a game where amateur human players have difficulty
coming up with high-scoring words, but for a computer, it is simple to find the
highest possible score for a given hand; the hard part is planning ahead in a
partially observable, stochastic game.

\newtermi{Video games} such as \termi{Starcraft II} involve hundreds of
partially-observable units moving in real time with high-dimensional
near-continuous\footnote{To a human player, it appears that objects move
continuously, but they are actually discrete at the level of a pixel on the
screen.}

\newtermi{Physical games} such as \termi{robotic soccer}\tindex{soccer}
\cite{Visser+al:2008,Barrett+Stone:2015}, \termi{billiards}
\cite{Lam+Greenspan:2008,Archibald+al:2009} and \termi{ping-pong}
\cite{Silva+al:2015} have attracted some attention in AI.

A
problem described this way is called a \newtermi{constraint satisfaction
problem}, or CSP.

Each constraint \(\mathcal{C}_j\)
consists of a pair \(\constraint{\J{scope}}{\J{rel}}\), where \(\J{scope}\) is
a tuple of variables that participate in the constraint and \(\J{rel}\) is a
\newtermi{relation} that defines the values that those variables can take on.

CSPs deal with
\newtermi[assignment]{assignments} of values to
variables, \(\{X_i=v_i, X_j=v_j, \ldots\}\).

An assignment that does not
violate any constraints is called a
\newterm{consistent} or legal
assignment.

A \newtermi{complete assignment} is one in which every variable is
assigned a value, and a \newtermi{solution} to a CSP is a consistent, complete
assignment.

A \newtermi{partial assignment} is one that leaves some variables
unassigned, and a \newtermi{partial solution} is a partial assignment that is consistent.


It can be helpful to visualize a CSP as a \newterm{constraint
graph}, as shown in
\figref{australia-figure}(b).

Next we represent \newtermi{precedence constraints}
between individual tasks.

We need a \newtermi{disjunctive constraint}
to say that \(\J{Axle}_F\) and \(\J{Axle}_B\) must
not overlap in time; either one comes first or the other does:
(\J{Axle}_{F} + 10 \le \J{Axle}_{B}) \mbox{\quad{\bf or}\quad} 
  (\J{Axle}_{B} + 10 \le \J{Axle}_{F}) \ .
This looks like a more complicated constraint, combining arithmetic and logic.

\subsection{Variations on the CSP formalism}

The simplest kind of CSP involves variables that have \newterm[discrete
domain]{discrete}, \newterm[finite domain]{finite
domains}.

A discrete domain can be \newterm{infinite}, such as
the set of integers or strings.

Special solution
algorithms (which we do not discuss here) exist for \newtermi{linear
constraints} on integer variables---that is,
constraints, such as the one just given, in which each variable appears only in
linear form.

It can be shown that no algorithm exists for solving general
\newtermi{nonlinear constraints} on integer
variables---the problem is undecidable.

Constraint satisfaction problems with \newtermi{continuous
domains} are common in the real world and are widely
studied in the field of operations research.

The simplest type is the
\newterm{unary constraint}, which restricts the value
of a single variable.

A \newterm{binary constraint} relates two variables.

A constraint involving an arbitrary number of variables is called a
\newtermi{global constraint}.

Another example is provided by
\newtermi{cryptarithmetic} puzzles (\figref{cryptarithmetic-figure}(a)).

These constraints can be represented in a \newterm{constraint
  hypergraph}, such as the one shown in
\figref{cryptarithmetic-figure}(b).

Another
way to convert an \(n\)-ary CSP to a binary one is the \newtermi{dual graph}
transformation: create a new graph in which there will be one variable for each
constraint in the original graph, and one binary constraint for each pair of
constraints in the original graph that share variables.

Many real-world CSPs include
\newterm{preference constraints}
indicating which solutions are preferred.

We call such a problem a \newtermi{constrained
optimization problem}, or COP.

It can
generate successors by choosing a new variable assignment, or it can do a
specific type of \newtermi{inference} called \newtermi{constraint propagation}:
using the constraints to reduce the number of legal values for a variable,
which in turn can reduce the legal values for another variable, and so on.

The key idea is \newtermi{local consistency}.

\subsection{Node consistency}

A single variable (corresponding to a node in the CSP network) is
\newterm[node consistency]{node-consistent} 
if all the values in the variable's domain satisfy the
variable's unary constraints.

\subsection{Arc consistency}

A variable in a CSP is \newterm[arc consistency]{arc-consistent} if every value in its
domain satisfies the variable's binary constraints.

A variable \(X_i\) is \newtermi{generalized arc


\newterm{Path
consistency} tightens the binary constraints by using implicit
constraints that are inferred by looking at triples of variables.

\subsection{{\it K}-consistency}

Stronger forms of propagation can be defined with the notion of
\newterm[k-consistency]{\(k\)-consistency}.

A CSP is \newterm[strongly k-consistent]{strongly \(k\)-consistent} if it is
\(k\)-consistent and is also \((k-1)\)-consistent, \((k-2)\)-consistent,
\(\ldots\) all the way down to 1-consistent.

Another important higher-order constraint is the \newterm{resource
  constraint}, sometimes called the
\(\J{atmost}\) constraint.

Instead, domains are represented by
upper and lower bounds and are managed by \newtermi{bounds propagation}.


We say that a CSP is \newtermi{bounds consistent} if for every variable
\(X\), and for both the lower-bound and upper-bound values of \(X\),
there exists some value of \(Y\) that satisfies the constraint between
\(X\) and \(Y\) for every variable \(Y\).

\subsection{Sudoku example}

The popular \newtermi{Sudoku} puzzle has introduced millions of people
to constraint satisfaction problems, although they may not realize
it.

We can get back that factor of \(n!\) by recognizing a crucial property of
CSPs: \newterm{commutativity}.

This intuitive idea---choosing the
variable with the fewest ``legal'' values---is called the
\newtermi{minimum-remaining-values} (MRV)
heuristic.

In this case, the \newterm{degree
  heuristic} comes
in handy.

The
\newterm{least-constraining-value} 
heuristic is effective for this.


One of the simplest forms of inference is called \newtermi{forward
checking}.

The algorithm called MAC (for \newterm{Maintaining Arc Consistency}) detects
inconsistencies like this.


This is called
\newterm{chronological
  backtracking}\index{chronological
  backtracking} because the {\em most recent} decision point is
revisited.

The set (in this case
\(\{Q\eq \J{red},\J{NSW}\eq \J{green},V\eq \J{blue},\}\)),
is called the \newtermi{conflict set} for \(\J{SA}\).

The \newtermi{backjumping}
method backtracks to the {\em most recent} assignment in the conflict
set; in this case, backjumping would jump over Tasmania and try a new
value for \(V\).

A backjumping
algorithm that uses conflict sets defined in this way is called
\newtermi{conflict-directed backjumping}.

\newterm{Constraint learning} is
the idea of finding a minimum set of variables from the conflict set
that causes the problem.

This set of variables, along with their
corresponding values, is called a
\newtermi{no-good}.

We'd like to change the value to something that brings us
closer to a solution; the most obvious approach is to select the value that
results in the minimum number of conflicts with other variables---the
\newterm{min-conflicts}\index{min-conflicts
heuristic} heuristic.

Another technique, called \newterm{constraint weighting}, can help
concentrate the search on the important constraints.

Intuitively, it is obvious that coloring Tasmania and coloring the
mainland are \newtermi{independent subproblems}---any solution for the
mainland combined with any solution for Tasmania yields a solution for
the whole map.

Independence can be ascertained simply by finding
\newterm[connected component]{connected components} of the
constraint graph.


The key is a new notion of consistency, called \newtermi{directional arc
consistency} or DAC.

Such an ordering is
called a \newtermi{topological sort}.

\(S\) is called a
\newterm{cycle cutset}.

The overall algorithmic approach is called
\newterm{cutset conditioning}; it comes up again
in \chapref{bayes-nets-chapter}, where it is used for reasoning about
probabilities.

\subsection{Tree decomposition}

The second approach is based on constructing a \newterm{tree
decomposition} of the constraint graph:
a transformation of the original graph into a tree where each node in the tree consists of a set of
variables, as in \figref{australia-decomposition-figure}.

The
\newtermi{tree width} of a tree decomposition of a graph
is one less than the size of the largest node; the tree width of the
graph itself is defined to be the minimum tree width among all its tree
decompositions.

This is called \newtermi{value symmetry}.

We do this by
introducing a \newterm{symmetry-breaking
constraint}\ntindex{symmetry breaking (in
CSPs)}.

A CSP that has been pre-processed this way is said


We now call equations over integer
domains \newtermi{Diophantine equations}.

For example, \newterm{graph coloring}
(of which map coloring is a special case) is an old problem in
mathematics.

Their technique of
\newterm{dependency-directed
backtracking} combines backjumping
with no-good learning \cite{McAllester:1990} led to the
development of \term{truth maintenance systems}\index{truth
maintenance system (TMS)}~\cite{Doyle:1979}, which we discuss in
\secref{tms-section}.

\newterm{Backmarking}~\cite{Gaschnig:1979} is a
particularly simple method in which consistent and inconsistent pairwise
assignments are saved and used to avoid rechecking constraints.

The method of \newtermi{dynamic
backtracking}~\cite{Ginsberg:1993} retains
successful partial assignments from later subsets of variables when
backtracking over an earlier choice that does not invalidate the later success.

The field of \newtermi{distributed constraint satisfaction} looks at
solving CSPs when there is a collection of agents, each of which
controls a subset of the constraint variables.

In AI, \newtermi{knowledge-based agents}
use a processes of \newtermi{reasoning} over an
internal \newterm[representation]{representations} of
knowledge decide what actions to take.

In
this Part, we take this step to its logical
conclusion, so to speak---we develop \newterm{logic} as a general
class of representations to support knowledge-based agents.

\section{Knowledge-Based Agents}

The central component of a knowledge-based agent is its
\newterm{knowledge base}, or
KB.

A knowledge base is a set of
\newterm[sentence]{sentences}.

Each sentence is expressed in a language called a \newterm{knowledge
representation language}
and represents some assertion about the world.

Sometimes we dignify a
sentence with the name \newtermi{axiom}, when the sentence is taken as
given without being derived from other sentences.

Both operations may involve
\newterm{inference}---that is, deriving new
sentences from old.

The agent maintains a knowledge base, \(\J{KB}\),
which may initially contain some \newterm{background
knowledge}.

It is amenable to a description at the \newterm{knowledge
  level}, where we need specify only what the
agent knows and what its goals are, in order to fix its behavior.

Notice that this analysis is independent of
how the taxi works at the \newtermi{implementation level}.

This is called the
\newterm{declarative} approach to system
building.

The \newterm{wumpus world}\ntindex{wumpus
  world} is a cave consisting of rooms connected by passageways.

These sentences are expressed according to the
\newtermi{syntax} of the representation language, which specifies
all the sentences that are well formed.

A logic must also define the \newtermi{semantics} or meaning of
sentences.

The semantics defines the \newtermi{truth} of each
sentence with respect to each \newtermi{possible world}.

When we need to be precise, we  use the term \newtermi{model} in
place of ``possible world.''

If a
sentence \(\alpha\) is true in model \(m\), we say that \(m\)
\newterm[satisfaction]{satisfies} \(\alpha\) or sometimes \(m\) \term{is a model of} \(\alpha\).

This involves the relation of logical \newtermi{entailment}
between sentences---the idea that a sentence {\em follows logically}
from another sentence.

The preceding example not only illustrates entailment but also
shows how the definition of entailment can be
applied to derive conclusions---that is, to carry out \newterm{logical
inference}.

The inference algorithm illustrated in
\figref{wumpus-entailment-figure} is called \newtermi{model checking},
because it enumerates all possible models to check that
\(\alpha\) is true in all models in which \(\J{KB}\) is true, that is, that \(M(KB)\subseteq M(\alpha)\).

An inference algorithm that derives only entailed sentences is called
\newterm{sound} or
\newterm{truth-preserving}.

The property of \newterm{completeness}\ntindex{completeness!proof
procedure@of a proof procedure} is also desirable: an inference
algorithm is complete if it can derive any sentence that is
entailed.


The final issue to consider is \newtermi{grounding}---the connection
between logical reasoning processes and the real environment in which
the agent exists.


\section{Propositional Logic: A Very Simple Logic}

We now present a simple but powerful logic called \newterm{propositional
logic}.

The \newterm{atomic
sentences} consist of a single
\newterm{proposition symbol}.

\newterm{Complex
sentences} are
constructed from simpler sentences, using parentheses and
\newterm{logical
connectives}\tindex{connective!logical}.

A sentence such as \(\lnot W_{1,3}\) is 
called the \newterm{negation} of \(W_{1,3}\).

A \newterm{literal} is either an atomic
sentence (a \termi{positive literal}) or a negated atomic sentence (a \termi{negative
literal}).

A sentence whose main connective is \(\land\),
such as \(W_{1,3} \land P_{3,1}\), is called a \newterm{conjunction}; its
parts are the \term{conjuncts}\tindex{conjunct}.

A sentence using \(\lor\), such as \((W_{1,3} \land P_{3,1}) \lor W_{2,2}\),
is a \newterm{disjunction} of the
\term{disjuncts}\tindex{disjunct} \((W_{1,3} \land P_{3,1})\) and \(W_{2,2}\).

A sentence such as  
\((W_{1,3} \land P_{3,1}) \implies \lnot W_{2,2}\)
is called an \newterm{implication} (or conditional).

Its
\newterm{premise} or \term{antecedent}\tindex{antecedent} is \((W_{1,3} \land P_{3,1})\), and its
\newterm{conclusion} or \term{consequent}\tindex{consequent} is \(\lnot W_{2,2}\).

Implications are
also known as \newtermi{rules}\tindex{rule!implication} or \term{if--then}\tindex{rule!if--then} statements.

The sentence \(W_{1,3}
\lequiv \lnot W_{2,2}\)
is a \newterm{biconditional}.

In propositional
logic, a model simply fixes the \newtermi{truth value}---\(\J{true}\) or \(\J{false}\)---for
every proposition symbol.


The rules can also be expressed with
\newterm[truth table]{truth tables} that specify the truth
value of a complex sentence for each possible assignment of truth
values to its components.

In this section, we show how entailment can be done by
\newtermi{theorem proving}---applying rules of inference directly
to the sentences in our knowledge base to construct a
proof of the desired sentence without consulting
models.

The first
concept is \newterm{logical equivalence}\ntindex{equivalence
(logical)}: two sentences \(\alpha\) and \(\beta\) are logically
equivalent if they are true in the same set of models.


The second concept we will need is \newtermi{validity}.

Valid sentences
are also known as \newterm[tautology]{tautologies}---they are
{\em necessarily} true.

From our definition of entailment, we
can derive the \newtermi{deduction theorem}}, which was known to the ancient Greeks:

For any sentences \(\alpha\) and \(\beta\),
\(\alpha\entails \beta\) if and only if the sentence \((\alpha \textimplies \beta)\) is valid.

The final concept we will need is \newtermi{satisfiability}.

The problem of determining the satisfiability of sentences
in propositional logic---the \newtermi{SAT} problem---was the first
problem proved to be \indextext{NP-complete}.


Proving \(\beta\) from \(\alpha\) by checking the unsatisfiability of
\((\alpha \land \lnot \beta)\) corresponds exactly to the standard mathematical proof
technique of \newterm[reductio ad
absurdum]{{\it reductio ad
absurdum}} (literally, ``reduction to an absurd thing'').

It is also
called proof by \newtermi{refutation} or proof by \newtermi{contradiction}.

\subsection{Inference and proofs}

This section covers \newtermi{inference rules} that can be applied to
derive a \newtermi{proof}---a chain of conclusions that leads to the desired goal.

The best-known rule is called \newterm{Modus Ponens}\ntindex{Modus
Ponens} (Latin for {\em mode that affirms}) and is written
\frac{\alpha \implies \beta,\qquad \alpha}{\beta} \ .
The notation means that, whenever any sentences of the form \(\alpha
\textimplies \beta\) and \(\alpha\) are given, then the sentence \(\beta\)
can be inferred.

Another useful inference rule is
\newterm{And-Elimination}, which says that,
from a conjunction, any of the conjuncts can be inferred:
\frac{\alpha \land \beta}{\alpha} \ .
For example, from \((\J{WumpusAhead} \land \J{WumpusAlive})\), \(\J{WumpusAlive}\) can
be inferred.

One final property of logical systems is \newterm{monotonicity}, which
says that the set of entailed sentences can only {\em increase} as
information is added to the knowledge base.\footnote{\term{Nonmonotonic} logics\tindex{nonmonotonic logic},
which violate the monotonicity property, capture a common property
of human reasoning: changing one's mind.


Now comes the first application of the resolution rule: the literal \(\lnot P_{2,2}\)
in \(R_{{13}}\) {\em resolves with} the literal \(P_{2,2}\) in \(R_{{15}}\) to
give the \newtermi{resolvent}

  R_{{16}}:\quad P_{1,1} \lor P_{3,1}\ .

These last two inference steps are examples of
the \newtermi{unit resolution} inference rule,
\frac {\ell_1 \lor \cdots\lor \ell_k,\qquad m}
        {\ell_1 \lor \cdots\lor \ell_{i-1}\lor \ell_{i+1}\lor\cdots \lor \ell_k}\ ,

where each \(\ell\) is a literal and \(\ell_i\) and \(m\) are
\newtermi{complementary literals} (i.e., one is the negation of the other).

Thus, the unit resolution rule takes a \newtermi{clause}---a
disjunction of literals---and a literal and produces a new clause.

Note that a single literal can be viewed as a disjunction of one
literal, also known as a \newtermi{unit clause}.

The unit resolution rule can be generalized to the full
\newtermi{resolution} rule,
\frac {\ell_1 \lor \cdots\lor \ell_k,\qquad m_1 \lor \cdots\lor m_n}
        {\ell_1 \lor \cdots\lor \ell_{i-1}\lor \ell_{i+1}\lor\cdots\lor \ell_k
        \lor m_1 \lor \cdots \lor m_{j-1}\lor m_{j+1}\lor\cdots\lor m_n} \ ,

where \(\ell_i\) and \(m_j\) are complementary literals.

The removal of multiple copies of
literals is called \newtermi{factoring}.


A sentence expressed as a conjunction of clauses is said to be in
\newtermi{conjunctive normal form} or \term{CNF}\tindex{CNF (Conjunctive Normal Form)} (see
\figref{cnf-bnf-figure}).

To do this,
we introduce the \newtermi{resolution
closure} \(\J{RC}(S)\) of 
a set of clauses \(S\), which is the set of all clauses
derivable by repeated application of the resolution rule to 
clauses in \(S\) or their derivatives.

The completeness theorem for resolution in
propositional logic is called the \newterm{ground resolution
theorem}:

\noindent If a set of clauses is unsatisfiable, then the
resolution closure of those clauses contains the empty clause.

One such restricted form is the \newtermi{definite clause}, which is
a disjunction of literals of which {\em exactly one is positive}.

For
example, the clause \((\lnot L_{1,1} \lor \lnot \J{Breeze} \lor
B_{1,1})\) is a definite clause, whereas \((\lnot B_{1,1} \lor P_{1,2} \lor
P_{2,1})\) is not.

Slightly more general is the \newtermi{Horn clause}, which is a
disjunction of literals of which {\em at most one is positive}.

So all
definite clauses are Horn clauses, as are clauses with no positive
literals; these are called \newtermi{goal clauses}.

In Horn form, the premise 
  is called the \newterm{body} and the conclusion
  is called the \newterm{head}.

A sentence consisting of a single positive
  literal, such as \(L_{1,1}\), is called a \newtermi{fact}.

Such sentences are called \newtermi{integrity


\item Inference with Horn clauses can be done through the
\newterm{forward-chaining} and \newterm{backward-chaining} algorithms,
which we explain next.

The easiest way to see this is to consider the final state
of the \var{inferred} table (after the algorithm reaches a
\newtermi{fixed point} where no new inferences are possible).


Forward chaining is an example of the general concept of
\newtermi{data-driven} reasoning---that is, reasoning in which the
focus of attention starts with the known data.

Backward chaining is a form of \newterm{goal-directed reasoning}.

\subsection{A complete backtracking algorithm}

The first algorithm we consider is often called the
\newtermi{Davis--Putnam algorithm}, after the seminal paper
by Martin Davis and Hilary Putnam~\citeyear{Davis+Putnam:1960}.

\item {\em Pure symbol heuristic}: A \newtermi{pure symbol} is a symbol that
  always appears with the same ``sign'' in all clauses.

This ``cascade'' of forced
assignments is called \newtermi{unit propagation}.

Thus, \(n\)-queens is easy because it is
\newterm{underconstrained}.

Theoretically, the \newtermi{satisfiability threshold conjecture}
says that for every \(k \ge 3\), there is a threshold ratio \(r_k\) such that,
as \(n\) goes to infinity, the probability that \(CNF_k(n,rn)\) is satisfiable
becomes 1 for all values of \(r\) below the threshold, and 0 for all values
above.

We  use the word
\newtermi{fluent} (from the Latin {\em fluens}, flowing) to refer an
aspect of the world that changes.

Symbols associated with
permanent aspects of the world do not need a time superscript and are
sometimes called \newterm[atemporal variable]{atemporal
variables}.

To describe how the world changes, we can try writing \newterm[effect
axiom]{effect axioms} that specify the outcome
of an action at the next time step.

The need to do this gives rise to the \newtermi{frame problem}.\footnote{The name
  ``frame problem'' comes from ``frame of reference'' in physics---the
  assumed stationary background with respect to which motion is
  measured.

One possible solution to the frame problem would be to add
\newterm[frame axiom]{frame axioms} explicitly asserting
all the propositions that remain the same.

This specific manifestation of the frame problem is
sometimes called the \newterm{representational frame
  problem}.

Fortunately for us humans,
each action typically changes no more than some
small number \(k\) of those fluents---the world exhibits
\newtermi{locality}.

There is also an \newtermi{inferential frame problem}:
the problem of projecting forward the results of a \(t\) step plan of action in time \(O(kt)\)
rather than \(O(nt)\).

An axiom of this form is called a
\newtermi{successor-state axiom} and has this schema:

F^{t+1} \lequiv 
  \J{ActionCausesF}^{t} \lor (F^{t} \land \lnot \J{ActionCausesNotF}^{t}) \ .

Specifying all these exceptions is called the
\newtermi{qualification
problem}.

\subsection{A hybrid agent}

The ability to deduce various aspects of the state of the world can be
combined fairly straightforwardly with condition--action rules and
with problem-solving algorithms from
Chapters~\ref{search-chapter} and~\ref{advanced-search-chapter} to
produce a \newterm{hybrid agent} for the wumpus world.

The obvious answer is to 
save, or \newterm[caching]{cache}, the results of
inference, so that the inference process at the next time step can
build on the results of earlier steps instead of having to start again
from scratch.


As illustrated in \figref{wiggly-belief-state-figure}, the 1-CNF
belief state acts as a simple outer envelope, or \newtermi{conservative
  approximation}, around the exact belief state.

To avoid generating
plans with illegal actions, we must add \newterm{precondition
axioms} stating
that an action occurrence requires the preconditions to be
satisfied.\footnote{Notice that the addition of precondition axioms
means that we need not include preconditions for actions in the
successor-state axioms.}

To eliminate this problem,
we introduce \newterm[action exclusion axiom]{action exclusion axioms}: for every pair of
actions \(A_i^t\) and \(A_j^t\) we add the axiom

\lnot A_i^t \lor \lnot A_j^t \ .

His
\newterm[syllogism]{syllogisms} were what we now call inference rules,
although they lacked the compositionally of our current rules.

algorithms such as \newtermi{survey
propagation}~\cite{Mezard+al:2002,Maneva+al:2007} take advantage of
special properties of random SAT instances near the satisfiability threshold and
greatly outperform general SAT solvers on such instances.

The \newterm{temporal-projection}
problem, which involves determining what propositions hold true after
an action sequence is executed, can be seen as a special case of state
estimation with empty percepts.

In this chapter, we examine
\newterm{first-order
logic},\footnote{Also called
\term{first-order predicate calculus}\tindex{predicate
calculus|see{logic, first-order}},  sometimes abbreviated as
\term{FOL} or
\term{FOPC}}.}

Propositional logic has a third property that is desirable
in representation languages, namely,
\newtermi{compositionality}.

Natural
languages also suffer from \newtermi{ambiguity}, a problem for
a representation language.

When we look at the
syntax of natural language, the most obvious elements are nouns and
noun phrases that refer to \newterm[object]{objects} (squares, pits, wumpuses)
and verbs and verb phrases that refer to \newterm[relation]{relations} among
objects (is breezy, is adjacent to, shoots).

Some of these relations
are
\newterm[function]{functions}---relations in which there is
only one ``value'' for a given ``input.''

It is easy to start listing
examples of objects, relations, and functions:
 
\item Objects: people, houses, numbers, theories,
Ronald McDonald\nindex{McDonald, R.}, colors, baseball games, wars, centuries \(\ldots\) 
\item Relations: these can be 
unary relations or \newterm[property]{properties} such as
red, round, bogus, prime, multistoried \(\ldots\), or
more general \(n\)-ary relations such as brother of, bigger than,
inside, part of, has color, occurred after, owns, comes between, \(\ldots\)
\item Functions: father of, best friend, third inning of, one more
than, beginning of \(\ldots\)

Indeed, almost any assertion can be thought of as referring to objects and
properties or relations.

The primary difference between propositional and first-order logic
lies in the \newterm{ontological commitment}\ntindex{ontological
commitment} made by each language---that
is, what it assumes about the nature of {\em reality}.

Various special-purpose logics make still further ontological commitments; for
example, \newtermi{temporal logic} assumes
that facts hold at particular {\em times} and that those times (which
may be points or intervals) are ordered.

\newterm{Higher-order
logic}  views
the relations and functions referred to by first-order logic as
objects in themselves.

A logic can also be characterized by its 
\newterm[epistemological commitment]{epistemological commitments}\ntindex{epistemological
commitment}---the possible states of
knowledge that it allows with respect to each fact.

The
\newterm{domain} of a model is
the set of objects or \newterm{domain
elements} it contains.

Formally speaking, a relation is just the set of
\newterm[tuple]{tuples} of objects that are related.


Strictly speaking, models in first-order logic require \newterm{total
functions}, that is, there must be a value for every input tuple.

The symbols,
therefore, come in three kinds: \newterm[constant symbol]{constant symbols}, which
stand for objects; \newterm[predicate symbol]{predicate symbols}, which stand for
relations; and \newterm[function symbol]{function symbols}, which stand for functions.

Each predicate and function symbol comes with
an \newtermi{arity} that fixes the number of arguments.

Thus, in addition to its objects, relations, and functions, each model includes an \newtermi{interpretation} that
specifies exactly which objects, relations and functions are referred
to by the constant, predicate, and function symbols.

One possible interpretation for our example---which a logician would call the \newterm{intended
interpretation}---is as follows:

\item \(\J{Richard}\) refers to Richard the Lionheart and \(\J{John}\) refers to
the evil King John.


\subsection{Terms}

A \newterm{term} is a logical expression that
refers to an object.

An \newtermi{atomic sentence} (or \newtermi{atom} for short) is formed from
a predicate symbol optionally followed by a parenthesized list of terms, such as
\J{Brother}(\J{Richard},\J{John}).
This states, under the intended interpretation given earlier, that
Richard the Lionheart is the brother of King John.\footnote{We 
usually follow the argument-ordering convention that \(P(x,y)\) is
read as ``\(x\) is a \(P\) of \(y\).''}

\newterm[Quantifier]{Quantifiers} let us do this.

The symbol \(x\) is called a \newterm{variable}\ntindex{variable!logic@in
logic}.

A term with no variables is called a
\newterm{ground term}.

More precisely, \(\forall\,x\;P\)
is true in a given model
if \(P\) is true in all possible \newterm[extended interpretation]{extended interpretations}
constructed from the interpretation given in the model, where each extended
interpretation specifies a domain element to which \(x\) refers.

We
can use the \newterm{equality symbol} to signify that
two terms refer to the same object.

First, we insist that every constant symbol refer
to a distinct object---the so-called \newterm{unique-names
assumption}.

Second, we assume that atomic sentences not known to be
true are in fact false---the \newterm{closed-world
assumption}.

Finally, we invoke \newtermi{domain closure}, meaning
that each model contains no more domain elements than those named by
the constant symbols.

Under the resulting semantics, which we call
\newtermi{database semantics} to
distinguish it from the standard semantics of first-order logic, the
sentence \eqref{john-bill-equation} does indeed state that Richard has
exactly two brothers, John and Geoffrey.

In this section, we provide example 
sentences in some simple \newterm[domain]{domains}\ntindex{domain!in
knowledge representation}.

Such sentences are called
\newterm[assertion]{assertions}.

Questions asked with \noprog{Ask} are called
\newterm[query]{queries} or
\newterm[goal]{goals}.

Such an answer
is called a \newterm{substitution} or
\newterm{binding list}.

Our kinship axioms are also \newterm[definition]{definitions}; they
have the form \(\All{x,y} P(x,y) \lequiv \ldots\).

Some are
\newterm[theorem]{theorems}---that is, they are
entailed by the axioms.


\subsection{Numbers, sets, and lists}

Numbers are perhaps the most vivid example of how a large theory
can be built up from a tiny kernel of axioms.
We  describe here the theory of \newtermi{natural numbers} or
non-negative integers.

The \newtermi{Peano axioms}
define natural numbers and addition.\footnote{The Peano axioms also
include the principle of induction, which is a sentence of
second-order logic rather than of first-order logic.

Notice the use of the binary function symbol ``\(+\)''
in the term \(+(m,0)\); in ordinary mathematics, the term
would be written \(m+0\) using \newterm{infix} notation.

(The notation
we have used for first-order logic is called \newterm{prefix}.)

The use of infix notation is an example of \newterm{syntactic
sugar}, that is, an extension to or abbreviation
of the standard syntax that does not change the semantics.


The logical sentence is as


The domain of \newterm[set]{sets} is also fundamental to
mathematics as well as to commonsense reasoning.


\newterm[List]{Lists} are  similar to sets.


Sentences that allow reasoning


This section describes
the general process of knowledge-base construction---a process called
\newterm{knowledge engineering}.

The knowledge engineer might already be an expert in the domain, or might
need to work with real experts to extract what they know---a process
called \newterm{knowledge acquisition}.

Once the choices have been made, the result is a
vocabulary that is known as the \newterm{ontology} of the domain.

This is a simple example of \newterm{circuit verification}.


From that we can infer any of the following sentences:
\zt 
   \J{King}(\J{John}) \land \J{Greedy}(\J{John})  \implies \J{Evil}(\J{John})\ \\\zt 
   \J{King}(\J{Richard}) \land \J{Greedy}(\J{Richard})  \implies \J{Evil}(\J{Richard})\ \\\zt 
   \J{King}(\J{Father}(\J{John})) \land \J{Greedy}(\J{Father}(\J{John}))  \implies \J{Evil}(\J{Father}(\J{John}))\ .\\\zt 
   \quad\vdots

In general, the rule of \newterm{Universal Instantiation}\ntindex{Universal
Instantiation} (\term{UI} for short)
says that we can infer any sentence obtained by substituting
a \newtermi{ground term} (a term without variables) for a universally quantified variable.\footnote{Do not confuse
these substitutions with the extended interpretations used to define
the semantics of quantifiers.

Similarly, the rule of \newterm{Existential Instantiation}
replaces a variable with a single {\em new constant symbol}.

In logic, the new name is called a \newtermi{Skolem
constant}.

This inference process can be captured as a single inference rule
that we call \newterm{Generalized Modus Ponens}:\footnote{Generalized Modus Ponens is more general than Modus Ponens (\pgref{propositional-inference-section}) in the sense that the known facts and the premise of the implication need match only up to a substitution, rather than exactly.

Generalized Modus Ponens is a
\newterm[lifting]{lifted} version of Modus 
Ponens---it raises Modus Ponens from ground (variable-free) propositional logic to first-order
logic.

This process is
called \newterm{unification}
and is a key component of all first-order inference algorithms.

The \prog{Unify} algorithm takes two sentences and returns a 
\newtermi{unifier} for them (a substitution) if one exists:
\mbox{\prog{Unify}}(p,q) \eq  \theta
        \mbox{ where } \noprog{Subst}(\theta,p) \eq  \noprog{Subst}(\theta,q)\ .

The problem can be avoided
by \newterm{standardizing apart} one of the two
sentences being unified, which means renaming its variables
to avoid name clashes.

It turns out that, for every unifiable pair of expressions,
there is a single \newterm{most general
unifier} (or MGU)
that is unique up to renaming and substitution of variables.

This so-called \newterm{occur check}
makes the complexity of the entire algorithm quadratic in the size of
the expressions being unified.

We can avoid such
unifications by \newtermi{indexing} the facts
in the knowledge base.

A simple scheme called \newtermi{predicate
indexing} puts all the
\(\J{Knows}\) facts in one bucket and all the \(\J{Brother}\) facts in another.


These queries form a \newterm{subsumption lattice}, as shown in
\figref{subsumption-lattice-figure}(a).


This knowledge base contains no function symbols and is therefore an
instance of the class of \newtermi{Datalog} knowledge bases.

Notice that a fact is not ``new'' if
it is just a \newterm{renaming} of a known fact---a 
sentence is a renaming of another if they are
identical except for the names of the variables.


This is
the \newtermi{conjunct ordering} problem: find an ordering to solve
the conjuncts of the rule premise so that the total cost is
minimized.

It is common in the database
world to assume that both the sizes of rules and the arities of
predicates are bounded by a constant and to worry only about
\newtermi{data complexity}---that is, the complexity of inference as a
function of the number of ground facts in the knowledge base.

The \newterm{rete} algorithm\footnote{Rete is
Latin for net.

Rete networks, and various improvements thereon, have been
a key component of so-called \newterm[production system]{production systems}, which 
were among the earliest forward-chaining systems in widespread use.\footnote{The word \term{production} in \term{production systems} denotes 
a condition--action rule.}

Production systems are also popular in \newterm{cognitive
architectures}---that is, models of human reasoning---such as \system{ACT}~\cite{Anderson:1983}
and \system{Soar}~\cite{Laird+al:1987}.

A third approach has emerged in the field of \newterm{deductive
databases}, which are large-scale databases, like relational databases, but
which use forward chaining as the standard inference tool rather than
SQL queries.

The idea is to rewrite the rule set, using information
from the goal, so that only relevant variable bindings---those
belonging to a so-called \newterm{magic set}---are considered during
forward inference.

So we 
implement \prog{FOL-BC-Ask} as a \newtermi{generator}---a function
that returns multiple times, each time giving one possible result.


The ideal is summed up in Robert Kowalski's\nindex{Kowalski, R.}
equation,

\J{Algorithm} = \J{Logic} + \J{Control}\ .

\newterm{Prolog} is the most widely used logic programming language.

Prolog interpreters have a global data structure, a stack of \newterm[choice point]{choice points},
to keep track of the multiple possibilities that we considered in \prog{FOL-BC-Or}.

This is done by keeping track of all the variables that
have been bound in a stack called the \newterm{trail}.

It is also possible to \newterm{open-code} the
unification routine for each different call, thereby avoiding explicit
analysis of term structure.

\item The trickiest part is the use of \newterm[continuation]{continuations} to implement
choice points.

The first, called
\newterm{OR-parallelism}, comes from the possibility of a goal unifying with
many different clauses in the knowledge base.

The second, called
\newterm{AND-parallelism}, comes from the possibility of solving each conjunct
in the body of an implication in parallel.

For the problem in
\figref{prolog-ribbon-figure}(b), only 62 inferences are
needed.

Forward chaining on graph search problems is an example of
\newtermi{dynamic programming}, in which the solutions to subproblems
are constructed incrementally from those of smaller subproblems and
are cached to avoid recomputation.

This is the approach taken by \newterm{tabled logic
programming} systems, which use
efficient storage and retrieval mechanisms to perform memoization.


This is called the \newterm{completion}\ntindex{completion (of a data
base)} of \eqref{course-db-equation}.


Technically, we say that \(M\) is a


\newterm{Constraint logic programming} (CLP)
allows variables to be {\em constrained} rather than {\em bound}.

In some CLP systems, on the other hand,


The \nosystem{MRS} language
\cite{Genesereth+Smith:1981,Russell:1985} allows the programmer to write
\newterm[metarule]{metarules} to determine which conjuncts are tried first.

In 1930,


In 1931, G\"odel proved an even more famous


\colonitem{Skolemize}
\newterm{Skolemization}
is the process of removing existential quantifiers by elimination.


Here \(F\) and \(G\) are \newterm[Skolem function]{Skolem
functions}.


This rule is called the \newtermi{binary resolution} rule
because it resolves exactly two literals.

Unfortunately, resolution can sometimes produce \newterm[nonconstructive proof]{nonconstructive proofs} for existential goals,
where we know a query is true, but there isn't a unique binding for the variable.

Another
solution is to add a special \newtermi{answer literal} to the negated
goal, which becomes \(\lnot \J{Kills}(w,\J{Tuna}) \lor \J{Answer}(w)\).

We show that resolution is
\newterm[refutation completeness]{refutation-complete}, which means that {\em if} a set of sentences is
unsatisfiable, then resolution will always be able to derive a contradiction.

\noindent To carry out the first step, we  need three new concepts:

\newtermitem{Herbrand universe} If \(S\) is a set of clauses, then 
\(H_S\), the Herbrand universe of \(S\), is the set of all ground terms
constructable from the following: 

\item[a.]


\newtermitem{Saturation} If \(S\) is a set of clauses and \(P\) is a set of
ground terms, then \(P(S)\), the saturation of \(S\) with respect to \(P\), is
the set of all ground clauses obtained by applying all possible
consistent substitutions of  ground terms in
\(P\) with variables in \(S\).

\newtermitem{Herbrand base} The saturation of a set \(S\) 
of clauses with
respect to its Herbrand universe is called the Herbrand base of \(S\),
written as \(H_S(S)\).

For example, if \(S\) contains solely the clause
just given, then \(H_S(S)\) is the infinite set of clauses
\zt 
\{ \lnot P(A,F(A,A)) \lor \lnot Q(A,A) \lor R(A,B), \\\zt 
\ \ \lnot P(B,F(B,A)) \lor \lnot Q(B,A) \lor R(B,B), \\\zt 
\ \ \lnot P(F(A,A),F(F(A,A),A)) \lor \lnot Q(F(A,A),A) \lor R(F(A,A),B),\\\zt 
\ \ \lnot P(F(A,B),F(F(A,B),A)) \lor \lnot Q(F(A,B),A) \lor R(F(A,B),B),\ldots\ \}


These definitions allow us to state a form of \newterm{Herbrand's
theorem}\ntindex{Herbrand's
theorem}~\cite{Herbrand:1930}:

\noindent If a set \(S\) of clauses is unsatisfiable, then there exists 
a finite subset of \(H_S(S)\) that is also unsatisfiable.


This is called a \newterm{lifting lemma}, because it lifts a
proof step from ground clauses up to general first-order clauses.


More formally, we have

\newtermitem{Demodulation} For any terms \(x\), \(y\), and \(z\), where
\(z\) appears somewhere in literal \(m_i\) and where \(\prog{Unify}(x,z)=\theta\),

\frac{x\eq y, \qquad\qquad m_1 \lor \cdots\lor m_n}{
\noprog{Sub}(\noprog{Subst}(\theta, x), \noprog{Subst}(\theta, y), m_1 \lor \cdots\lor m_n)}\ .


The rule can also be extended to handle non-unit clauses in which an
equality literal appears:

\newtermitem{Paramodulation} For any terms \(x\), \(y\), and \(z\), where
\(z\) appears somewhere in literal \(m_i\), and where \(\prog{Unify}(x,z)=\theta\),

\frac{\ell_1 \lor \cdots\lor \ell_k\lor x\eq y, \qquad\qquad m_1 \lor \cdots\lor m_n}{
\noprog{Sub}(\noprog{Subst}(\theta, x), \noprog{Subst}(\theta, y), 
    \noprog{Subst}(\theta,\ell_1 \lor \cdots\lor \ell_k\lor m_1 \lor \cdots\lor m_n)}\ .

\newterm{Equational
unification} of this kind can be done with efficient algorithms
designed for the particular axioms used (commutativity, associativity,
and so on) rather than through explicit inference with those axioms.

\medskip\noindent\newterm{Unit preference}: 
This strategy prefers to do resolutions where one of the sentences is
a single literal (also known as a \term{unit clause}\index{unit
clause}).

\medskip\noindent\newterm{Set of support}: 
Preferences that try certain resolutions first are helpful, but in
general it is more effective to try to eliminate some potential
resolutions altogether.

\medskip\noindent\newterm{Input resolution}: 
In this strategy, every resolution combines one of
the input sentences (from the KB or the query) with some other sentence.

The \newterm{linear resolution} strategy is a slight generalization that
allows \(P\) and \(Q\) to be resolved together either if \(P\) is in the original
\(\J{KB}\) or if \(P\) is an ancestor of \(Q\) in the proof tree.

\medskip\noindent\newterm{Subsumption}: 
The subsumption method eliminates all sentences that are subsumed by (that is, more specific
than) an existing sentence in the KB.

This technique,


A theorem prover can


Logic has proven to be more successful for scenarios involving formal, strictly defined concepts,
such as the
\newterm{synthesis} and
\newterm{verification} of both hardware and
software.

Although fully automated \newterm{deductive
synthesis}, as it is called, has not yet
become feasible for general-purpose programming, hand-guided deductive
synthesis has been successful in designing several novel and
sophisticated algorithms.

The production system
language \system{Ops-5}~\cite{Forgy:1981,Brownston+al:1985},
incorporating the efficient \newtermi{rete} match
process~\cite{Forgy:1982}, was used for applications such as the
\system{R1} expert system for minicomputer
configuration~\cite{McDermott:1982}.

In 1933, Herbert Robbins proposed a simple
set of axioms---the \newtermi{Robbins algebra}---that appeared to define
Boolean algebra, but no proof could be found (despite serious work by Alfred
Tarski and others) until \system{EQP} (a version of \system{Otter}) computed a
proof~\cite{McCune:1997}.

Representing these abstract concepts is
sometimes called \newterm{ontological engineering}\ntindex{ontological
engineering}.

The general framework of concepts is called
an \newterm{upper ontology} because of the convention of
drawing graphs with the general concepts at the top and the more
specific concepts below them, as in \figref{ontology-figure}.


\section{Categories and Objects}

\vspace*{-8pt}

The organization of objects into \newterm[category]{categories} is a
vital part of knowledge representation.

That is, we can use the predicate
\(\J{Basketball}(b)\), or we can \newterm[reification]{reify}\footnote{Turning 
a proposition into an object is called
\term{reification}\tindex{reification},
from the Latin word {\em res}, or thing.

We say
\(\J{Subset}(\J{Basketballs}, \J{Balls})\), abbreviated as \(\J{Basketballs} \subset
\J{Balls}\), to say that \(\J{Basketballs}\) is a \newtermi{subcategory} of
\(\J{Balls}\).

Categories serve to organize and simplify the knowledge base through
\newterm{inheritance}.

Subclass relations organize categories into a
\newterm{taxonomy}, or \term{taxonomic
hierarchy}\tindex{taxonomic hierarchy}.

We say that two or more categories are
\newterm{disjoint} if they have no members in common.

We
may also want to say that the classes undergrad and graduate student form an
\newterm{exhaustive decomposition} of
university students.

A exhaustive decomposition of disjoint sets is known as a
\newterm{partition}.

Categories of \newterm[composite object]{composite
objects} are often characterized by structural relations among parts.

Instead, we need a new concept, which we will
call a \newterm{bunch}.


These axioms are an example of a general technique called
\newterm{logical minimization}, which
means defining an object as the smallest one satisfying certain
conditions.

On the
other hand, most categories in the real world have no clear-cut
definition; these are called \newterm{natural kind}\ntindex{natural
kind} categories.

The values that we assign for
these properties are called \newterm[measure]{measures}.

Thus, the same
length has different names in our language.We represent the length
with a \newterm{units function}
that takes  a number as argument.

There is, however, a significant portion of
reality that seems to defy any obvious
\newterm{individuation}---division into distinct 
objects.

We give this portion the generic name
\newterm{stuff}.

Linguists distinguish between \newterm{count
nouns}, such as aardvarks, holes, and theorems, and
\newterm[mass noun]{mass nouns}, such as butter, water, and
energy.

What is actually going on is this: some properties are
\newterm{intrinsic}: they belong to the
very substance of the object, rather than to the object as a
whole.

On the other hand,
their \newterm{extrinsic} properties---weight, length, shape, and so
on---are not retained under subdivision.

To handle such cases we introduce an
alternative formalism known as \newterm{event
  calculus}, which is based on points of time
rather than on situations.

We can extend event calculus to make it possible to
represent simultaneous events (such as two people being necessary to
ride a seesaw), exogenous events (such as the wind blowing and
changing the location of an object), continuous events (such as the
level of water in the bathtub continuously rising) and other
complications.  

\subsection{Processes}

The events we have seen so far are what we call \newterm{discrete
events}---they have a definite structure.

Categories of events with this property are called
\newterm{process} categories or
\newterm{liquid event}
categories.

In fact, some have called liquid events
\newterm[temporal substance]{temporal substances}, whereas substances like butter are
\newterm[spatial substance]{spatial substances}.

We begin with the \newterm[propositional attitude]{propositional
attitudes} that an agent can have
toward mental objects: attitudes such as \(\J{Believes}\),
\(\J{Knows}\), \(\J{Wants}\), \(\J{Intends}\), and \(\J{Informs}\).

This property is called \newtermi{referential
transparency}---it doesn't matter what term a logic uses to refer to
an object, what matters is the object that the term names.

\newterm{Modal logic}\tindex{modal logic} is designed to address this problem.

Modal logic
includes special \newtermi{modal operators} that take sentences (rather than
terms) as arguments.

Therefore, we
will need a more complicated model, one that consists of a collection
of \newterm[possible world]{possible worlds} rather than just one true world.

The
worlds are connected in a graph by \newtermi{accessibility relations},
one relation for each modal operator.

However, one problem with the modal logic
approach is that it assumes \newtermi{logical omniscience} on the part
of agents.

In \newtermi{linear temporal logic}, we add the following modal operators:


	\item {\bf X} {\em P}: ``{\em P} will hold in the next time step''
	\item {\bf F} {\em P}: ``{\em P} will eventually ({\bf F}inally) hold in some future time step''
	\item {\bf G} {\em P}: ``{\em P} is always ({\bf G}lobally) true''
	\item {\em P} {\bf U} {\em Q}:  ``{\em P} remains true until {\em Q} occurs''
 

Sometimes there are additional operators that can be derived from these.

proposed a graphical
notation of nodes and edges called \newterm{existential
graphs} that he called ``the logic of the
future.''

Inheritance becomes complicated when an object can belong to more than
one category or when a category can be a subset of more than one other
category; this is called \newterm{multiple
inheritance}.


One of the most important aspects of semantic networks is their
ability to represent \newterm[default value]{default values}
for categories.

We say that the default is
\newterm[overriding]{overridden} by the more
specific value.

\newterm[Description logic]{Description logics} are notations
that are designed to make it easier to describe definitions and
properties of categories.

The principal
inference tasks for description logics are
\newterm{subsumption} (checking
if one category is
a subset of another by comparing their definitions) and
\newterm{classification}\ntindex{classification (in description
logic)} (checking whether an object belongs to a
category)..

This kind of reasoning is said to exhibit
\newterm{nonmonotonicity}, because the set of
beliefs does not grow monotonically over time as new evidence
arrives.

\newterm[Nonmonotonic logic]{Nonmonotonic logics}
have been devised with modified notions of truth and entailment in
order to capture such behavior.

\newterm{Circumscription} can be seen as
a more powerful and precise version of the closed-world assumption.

Circumscription can be viewed as an example of a \newterm{model
preference} logic.

If we wish, in addition, to assert
that religious beliefs take precedence over political beliefs, we can
use a formalism called \newterm{prioritized
circumscription}
to give preference to models where \(\J{Abnormal}{}_3\) is minimized.

\newterm{Default logic}
is a formalism in which 
\newterm{default
rules} can be written to generate contingent,
nonmonotonic conclusions.


To interpret what the default rules mean, we define the notion of an
\newterm{extension} of a
default theory to be a maximal set of consequences of the theory.

This process is called
\newterm{belief revision}.\footnote{Belief
revision is often contrasted with \term{belief update}, which occurs when a knowledge base is revised to reflect a
change in the world rather than new information about a fixed world.

\newterm[truth maintenance
system]{Truth maintenance systems},
or TMSs, are designed to
handle exactly these kinds of complications.

A more efficient approach is the
justification-based truth maintenance system\ntindex{truth maintenance
system (TMS)!justification-based}, or \newterm{JTMS}.

In a JTMS, each sentence
in the knowledge base is annotated with a \newterm{justification}
consisting of the set of sentences from which it was inferred.

An assumption-based truth maintenance system, or \newterm{ATMS},
makes this type of context-switching between hypothetical
worlds particularly efficient.

Truth maintenance systems also provide a mechanism for generating
\newterm[explanation]{explanations}.

But explanations can also include
\newterm[assumption]{assumptions}---sentences that are not known to be true, but would
suffice to prove \(P\) if they were true.

They behave less like a traditional
information-extraction system that is targeted at a few relations and
more like a human reader who learns from the text itself; because of
this the field has been called \newtermi{machine reading}.

His
\newterm{mereology} (the name is derived from the
Greek word for ``part'') used the part--whole relation as a substitute
for mathematical set theory, with the aim of eliminating abstract
entities such as sets.

The third approach is a \newterm{syntactic theory}\ntindex{syntactic
theory (of knowledge)}, in which  mental objects are
represented by character strings.

The three principal topics omitted
are the following:

\medskip\noindent\newterm{Qualitative physics}: 
Qualitative physics is a subfield of knowledge representation
concerned specifically with constructing a logical, nonnumeric theory
of physical objects and processes.

\medskip\noindent\newterm{Spatial reasoning}:
The reasoning necessary to navigate in the wumpus world and shopping
world is trivial in comparison to the rich spatial structure of the
real world.

\medskip\noindent\newterm{Psychological reasoning}:
Psychological reasoning involves
the development of a working {\em psychology\/} for artificial agents
to use in reasoning about themselves and other agents.

\section{Definition of Classical Planning}

\newtermi{Classical planning} is defined as the task of finding a sequence of actions to accomplish a goal
in a discrete, deterministic, static, fully observable environment.

In response to these limitations, planning researchers have invested in a
\termi{factored representation} using a family of languages called
\newterm{PDDL}, the
Planning Domain Definition Language, which allows us to express all \(4
T n^2\) actions with a single action schema, and does not need domain-specific knowledge.

An \newtermi{action schema} represents a family of ground actions.


\J{Action}(\J{Fly}(p, \J{from}, \J{to}),\\
\;\;\;\;  \Pre{\J{At}(p, \J{from}) \land \J{Plane}(p) \land \J{Airport}(\J{from}) \land \J{Airport}(\J{to})}\\
\;\;\;\;  \Eff{\lnot \J{At}(p, \J{from}) \land \J{At}(p, \J{to})})


The schema consists of the action name, a list of all the variables
used in the schema, a \newtermi{precondition} and an
\newtermi{effect}.

The \termi{result} of executing action \(a\) in state \(s\) is defined as a
state \(s'\) which is represented by the set of fluents
formed by starting with \(s\), removing the fluents that appear as
negative literals in the action's effects (what we call the
\newtermi{delete list} or \(\prog{Del}(a)\)), and adding the fluents
that are positive literals in the action's effects (what we call the \newtermi{add list} or
\(\prog{Add}(a)\)):

\result{s}{a} = (s - \noprog{Del}(a)) \union \noprog{Add}(a) \ .

The \newtermi{initial state} is a
conjunction of ground atoms.

The \newtermi{goal} is just like a precondition: a
conjunction of literals (positive or negative) that may contain
variables, such as \(\J{At}(p,\J{SFO}) \land \J{Plane}(p)\), which is the goal of 
having {\em any} plane at SFO.


\subsection{Example domain: The blocks world}


One of the most famous planning domains is the
\newtermi{blocks world}.

\newtermi{Situation calculus} is a method of describing planning problems in first-order logic.

An alternative called \newtermi{partial-order planning} represents a plan
as a graph of actions rather than a linear sequence.

For example, the
\newtermi{ignore preconditions heuristic} drops all preconditions
from actions.

This is an
instance of the \newtermi{set-cover problem}.

Another possibility is the \newtermi{ignore delete lists}
heuristic.

This is the process of \newtermi{symmetry reduction}: we prune out of consideration all symmetric branches 
of the search tree except for one.

We can define a \newtermi{preferred action} as follows: 
First, define a relaxed version of the problem, and solve it to get a \term{relaxed plan}.

We say that a problem
has \newterm[serializable subgoal]{serializable subgoals}\ntindex{serializable
subgoals} if there exists an order of
subgoals such that the planner can achieve them in that order without
having to undo any of the previously achieved subgoals.

Therefore,
we now look at relaxations that decrease the number of states by
forming a \newtermi{state abstraction}---a many-to-one mapping from
states in the ground representation of the problem to the abstract
representation.

A key idea in defining heuristics is \newtermi{decomposition}: dividing
a problem into parts, solving each part independently, and then
combining the parts.

The \newterm{subgoal independence}
assumption is that the cost of solving a conjunction of subgoals is
approximated by the sum of the costs of solving each subgoal {\em
independently}.

Here, we
concentrate on the idea of \newterm{hierarchical
decomposition}, an idea that
pervades almost all attempts to manage complexity.

\subsection{High-level actions}

The basic formalism we adopt to understand hierarchical decomposition
comes from the area of \newterm[hierarchical task
network]{hierarchical task networks}\ntindex{hierarchical task network
(HTN)} or HTN\index{planning!hierarchical task
network} planning.

For now we assume
full observability and determinism and  a set of
actions, now called \newterm[primitive action]{primitive
actions}, with standard precondition--effect
schemas.

The key additional concept is the
\newterm{high-level action} or
HLA---for example, the action ``Go
to San Francisco airport.''

Each HLA has
one or more possible
\newterm[refinement]{refinements},\ntindex{refinement (in hierarchical
planning)} into a sequence of actions, each of which may be
an HLA or a primitive action.


An HLA refinement that contains only primitive actions
is called an
\newterm{implementation}\ntindex{implementation (of a high-level
action)} of the HLA.

This property has been
called the \newtermi{downward refinement property} for HLA
descriptions.

The programming languages community has coined the term
\newterm{demonic nondeterminism} for
the case where an adversary makes the choices, contrasting this with
\newterm{angelic nondeterminism}, 
where the agent itself makes the choices.

We borrow this term
to define \newterm{angelic semantics} for HLA descriptions.

The basic
concept required for understanding angelic semantics is the
\newtermi{reachable set} of an HLA: given a state \(s\), the reachable
set for an HLA \(h\), written as \(\reach{s}{h}\), is the set of
states reachable by any of the HLA's implementations.

We will
use two kinds of approximation: an \newterm{optimistic
  description} \(\oreach{s}{h}\)
of an HLA \(h\) may overstate the reachable set, while a \newterm{pessimistic
  description} \(\preach{s}{h}\) may
understate the reachable set.

The same approach can be used to obtain
effective \newtermi{hierarchical lookahead} algorithms for online
search, in the style of LRTA{\star}
(\pgref{lrta-agent-algorithm}).

For planning, we augment PDDL
with a new type of schema, the \newtermi{percept schema}:

\J{Percept}(\J{Color}(x,c),\\
\;\;\;\;\;\;\;\Pre{\J{Object}(x) \land \J{InView}(x)}\\
\J{Percept}(\J{Color}(can,c),\\
\;\;\;\;\;\;\;\Pre{\J{Can}(can) \land \J{InView}(can) \land \J{Open}(can)}\\

The first schema says that whenever an object is in view, the agent
will perceive the color of the object (that is, for the object \(x\),
the agent will learn the truth value of \(\J{Color}(x,c)\) for all
\(c\)).

For such actions, our action schemas will need something new:
a \newtermi{conditional effect}.

Replanning presupposes some form of \newtermi{execution monitoring} to
determine the need for a new plan.

The model for an action may have a \newterm{missing precondition}---for example, the
agent may not know that removing the lid of a paint can often requires
a screwdriver; the model may have a \newterm{missing
  effect}---for example, painting an object
may get paint on the floor as well; or the model may have 
a \newterm{missing state variable}---for
example, the model given earlier has no notion of the amount of paint
in a can, of how its actions affect this amount, or of the need for
the amount to be nonzero.

The model may also lack provision for
\newterm[exogenous event]{exogenous
  events} such as someone  knocking over
the paint can.

We distinguish three levels:

\item \newterm{Action monitoring}\tindex{action monitoring}: before executing an action, the
  agent verifies that all the preconditions still hold.

\item \newterm{Plan monitoring}\tindex{plan monitoring}: before executing an action, the
  agent verifies that the remaining plan will still succeed.

\item \newterm{Goal monitoring}\tindex{goal monitoring}: before executing an action, the
  agent checks to see if there is a better set of goals it could be
  trying to achieve.

This is the subject matter of \newtermi{scheduling}.

The real world also imposes  \newtermi{resource constraints}:  an airline
has a limited number of staff, and staff who are on one flight cannot
be on another at the same time.

\subsection{Representing temporal and resource constraints}

A typical \newtermi{job-shop scheduling problem} (see)
\secref{csp-job-shop-scheduling-section}), consists of a set of \newterm[job]{jobs}, each of
which has a collection of \term{actions} with ordering constraints among them.

Each action has a \newtermi{duration} and a set of 
resource constraints required by the action.

A constraint specifies
a {\em type} of resource (e.g., bolts, wrenches, or pilots),
the number of that resource required, and whether that resource is
\newterm{consumable} (e.g., the bolts are no longer available for use) or
\newterm{reusable} (e.g., a pilot is occupied during 
a flight but is available again when the flight is over).

For simplicity, we  assume that the cost function is just the total duration of the plan,
which is called the \newtermi{makespan}.

The representation of resources as numerical quantities, such as
\(\J{Inspectors}(2)\), rather than as named entities, such as
\(\J{Inspector}(I_1)\) and \(\J{Inspector}(I_2)\), is an example of a 
technique called \newtermi{aggregation}:
grouping individual objects into quantities when the objects are
all indistinguishable.

We can apply the \newterm{critical path  method}\ntindex{critical
  path} (CPM) to this graph to determine the possible start
and end times of each action.


The \newterm{critical path} is that path whose
total duration is longest; the path is ``critical'' because it
determines the duration of the entire plan---shortening other paths
doesn't shorten the plan as a whole, but delaying the start of any
action on the critical path slows down the whole plan.

The quantity \(\J{LS}\) -- \(\J{ES}\) is known as the
\newterm{slack} of an action.

Together the
\(\J{ES}\) and \(\J{LS}\) times for all the actions constitute a
\newterm{schedule} for the problem.

One  popular approach is the \newtermi{minimum slack} heuristic:
on each iteration, schedule for the earliest possible start whichever unscheduled action has all its predecessors scheduled and has the least slack; then update the \(\J{ES}\) and \(\J{LS}\) times for each
affected action and repeat.

We are seeing examples of \newtermi{portfolio} systems, where a collection of algorithms are available to apply
to any given problem.

This approach, called \newterm{linear
planning} by \citeA{Sacerdoti:1975}, was soon
discovered to be incomplete.

A complete planner must allow for
\newterm{interleaving} of actions from
different subplans within a single sequence.

There has also been interest in the representation of plans
as \newterm[binary decision diagram]{binary decision diagrams}\ntindex{binary decision
diagram}, compact data structures for Boolean expressions widely studied in
the hardware verification community
\cite{Clarke+Grumberg:1987,McMillan:1993}.


The first mechanism for hierarchical planning was 
a facility in the \system{Strips} program for learning
\newterm{macrops}---``macro-operators'' consisting of
a sequence of primitive steps~\cite{Fikes+al:1972}.

The \system{Abstrips} system \cite{Sacerdoti:1974} introduced the idea
of an \newterm{abstraction hierarchy}, whereby planning at higher
levels was permitted to ignore lower-level preconditions of actions in
order to derive the general structure of a working plan.

This is the approach taken by the
field called \newterm{case-based
planning}~\cite{Carbonell:1983,Alterman:1988,Hammond:1989}.

In the mid-1980s, pessimism about the slow run times of planning
systems led to the proposal of reflex agents called \newterm{reactive
planning}
systems \cite{Brooks:1986,Agre+Chapman:1987}.

``Universal plans''
\cite{Schoppers:1987,Schoppers:1989} were developed as a lookup-table
method for reactive planning, but turned out to be a rediscovery of
the idea of \newterm[policy]{policies}\tindex{policy} that had long been used
in Markov decision processes (see \chapref{complex-decisions-chapter}).

There are two main tasks: 
\newtermi{PlanSAT} is the question of whether there exists any plan that solves a planning
problem.

\newterm{Bounded PlanSAT}\tindex{PlanSAT!bounded} asks whether there is a solution
of length \(k\) or less; this can be used to find an optimal plan.

\section{Acting under Uncertainty}

Agents in the real world need to handle
\newterm{uncertainty}\tindex{uncertainty}, whether due to partial
observability, nondeterminism, or adversaries.

Trying to use  logic to cope with a domain like medical
diagnosis thus fails for three main reasons:

\newtermitem{Laziness} It is too much work to 
list the complete set of antecedents
or consequents needed to ensure an exceptionless rule and too hard to
use such rules.

\newtermitem{Theoretical ignorance}
Medical science has no complete theory for the domain.

\newtermitem{Practical ignorance}
Even if we know all the rules, we might be
uncertain about a particular patient because not all the necessary tests have
been or can be run.

The agent's
knowledge can at best provide only a \newtermi{degree of belief} in
the relevant sentences.

Our main tool for dealing with degrees of
belief is \newterm{probability
theory}.

To make such choices, an agent must first have
\newterm[preference]{preferences} among the different
possible \newterm[outcome]{outcomes} of the various plans.

We use \newterm{utility theory} to represent and reason with
preferences.

Preferences, as expressed by utilities, are combined with
probabilities in the general theory of rational decisions called
\newterm{decision theory}:
 
\mbox{\it Decision theory} = \mbox{\it probability theory} + \mbox{\it utility
theory}\ .


This is
called the principle of \newtermi{maximum expected utility}\tindex{utility!maximum
expected} (MEU).

In probability theory, the set of all possible worlds is called
the \newtermi{sample space}.

A fully specified \newterm{probability model} associates a numerical
probability \(P(\omega)\) with each possible world.\footnote{For now,
we assume a discrete, countable set of worlds.

In probability theory, these sets
are called \newterm[event]{events}---a term already used extensively
in \chapref{kr-chapter} for a different concept.

Probabilities such as \(P(\J{Total}\eq 11)\) and \(P(\J{doubles})\)
are called \newterm[unconditional
probability]{unconditional}\index{unconditional
probability|see{probability, prior}} or \newterm[prior
probability]{prior
probabilities} (and sometimes just ``priors'' for short); they refer to degrees of belief
in propositions {\em in the absence of any other information}.

Most
of the time, however, we have {\em some} information, usually called
\newtermi{evidence}, that has already been revealed.

In that case, we are
interested not in the unconditional probability of rolling doubles,
but the \newterm[conditional
probability]{conditional} or \newterm[posterior
probability]{posterior}\index{posterior probability|see{probability,
conditional}} probability (or just ``posterior'' for short) of rolling doubles {\em given that the first
die is a 5}.

The definition of conditional probability, \eqref{conditional-probability-equation}, can be
written in a different form called the \newterm{product rule}:

  P(a \land b) = P(a\given b)P(b) \ .


Variables in probability theory are called \newterm[random variable]{random variables}
and their names begin with an uppercase letter.

Every random variable
is a function that maps from the domain of possible worlds \(\Omega\)
to some \newtermi{range}---the set of possible values it can take on.

(An alternative range for Boolean variables is the set \(\{0,1\}\), in which case the variable is said to have
a \newterm{Bernoulli} distribution.)

We say that the \(\pv\) statement defines a
\newterm{probability distribution} for
the random variable \(\J{Weather}\)---that is, an assignment of a probability for each possible value of the random variable.

We call this a
\newterm[probability density
function]{probability density function}\ntindex{probability!density
function}.

This is a \(4
\stimes 2\) table of probabilities called the \newterm{joint
probability distribution} of
\(\J{Weather}\) and \(\J{Cavity}\).

From the preceding definition of possible worlds, it follows
that a probability model is completely determined by
the joint distribution for all of the random variables---the so-called
\newterm{full joint probability distribution}\ntindex{joint
probability distribution!full}.


We can also derive the well-known formula for the probability of a disjunction, sometimes called the \newtermi{inclusion--exclusion principle}:

  P(a \lor b) = P(a) + P(b) - P(a \land b)\ .

\eqrefs{basic-probability-axiom-equation}{kolmogorov-disjunction-equation}
are often called \newtermi{Kolmogorov's
axioms} in honor of the
Russian mathematician Andrei Kolmogorov, who showed how
to build up the rest of probability theory from this simple foundation
and how to handle the difficulties caused by continuous
variables.\footnote{The difficulties include the \termi{Vitali set}, a
well-defined subset of the interval \([0,1]\) with no
well-defined size.}

The \newterm{frequentist} position is that the
numbers can come only from {\em experiments}: if we test 100 people
and find that 10 of them have a cavity, then we can say that the
probability of a cavity is approximately 0.1.

The \newterm{objectivist} view is that
probabilities are real aspects of the universe---propensities of
objects to behave in certain ways---rather than being just
descriptions of an observer's degree of belief.

The \newterm{subjectivist} view describes
probabilities as a way of characterizing an agent's beliefs, rather than
as having any external physical significance.

In the end, even a strict frequentist position involves subjective
analysis because of the \newterm{reference class}\ntindex{reference
class} problem: in trying to determine the outcome probability of a
{\em particular} experiment, the frequentist has to place it in a
reference class of ``similar'' experiments with known outcome
frequencies.

\section{Inference Using Full Joint Distributions}

In this section we describe a simple method for
\newterm{probabilistic inference}---that is, 
the computation of posterior probabilities for
query propositions given observed evidence.

For example, adding the
entries in the first row gives the unconditional or \newterm{marginal
probability}\footnote{So called because
of a common practice among actuaries of writing the sums of observed
frequencies in the margins of insurance tables.


This process is called
\newterm{marginalization}, or \termi{summing
out}---because we sum up the probabilities for each possible value of
the other variables, thereby taking them out of the equation.


Using the product rule (\eqref{product-rule-equation}), we can replace
\(\pv(\Y,\z)\) in \eqref{marginalization-equation} by \pv(\Y\given \z)P(\z),
obtaining a rule called  \newtermi{conditioning}:

\pv(\Y) = \sum_{\sz} \pv(\Y\given \z)P(\z)\ .

In fact, it
can be viewed as a \newterm{normalization}
constant for the distribution \(\pv(\J{Cavity}\given \J{toothache})\), ensuring that
it adds up to 1.


The property we used in 
\eqref{weather-independence-equation} is called
\newterm{independence}
(also \term{marginal independence}\tindex{independence!marginal}
and \term{absolute independence}\tindex{independence!absolute}).


This equation is known as \newterm{Bayes' rule} (also Bayes'\nindex{Bayes, T.} law or Bayes'
theorem).


The conditional probability \(P(\effect\given \J{cause})\)
quantifies the relationship in the \newterm{causal}\ntindex{causal
  probability} direction, whereas \(P(\J{cause}\given \effect)\)
describes the \newterm{diagnostic} direction.

The use of this kind of direct causal or
model-based knowledge
provides the crucial robustness needed to make probabilistic systems
feasible in the real world.


For example, we can write


This equation expresses the \newterm{conditional
independence} of \(\J{toothache}\) and
\(\J{catch}\) given \(\J{Cavity}\).


Conceptually, \(\J{Cavity}\) 
\newterm[separation]{separates}\tindex{separator (in Bayes net)} \(\J{Toothache}\) and \(\J{Catch}\) because it is a direct cause of
both of them.


Such a probability distribution is called a \newterm{naive
  Bayes}
model---``naive'' because it is often used (as a simplifying
assumption) in cases where the ``effect'' variables are {\em not}
strictly independent given the cause variable.

\subsection{Text classification with naive Bayes}

Let's see how a naive Bayes model can be used for the task of
\newtermi{text classification}: given a text, decide which of a
predefined set of classes or categories it belongs to.

The use of this latter, subjective
consideration to justify assigning equal probabilities is known as the
\newterm{principle of indifference}.

George Boole and John Venn both referred to it as the \newterm{principle of insufficient reason};
the modern name is due to \citeA{Keynes:1921}.

Carnap attempted to go further
than Leibniz or Laplace by making this notion of degree of
\newterm{confirmation} mathematically
precise, as a logical relation between \(a\) and \(\e\).

The study of
this relation was intended to constitute a mathematical discipline
called \newterm{inductive logic}, analogous
to ordinary deductive logic \cite{Carnap:1948,Carnap:1950}.

This section introduces a data structure called a
\newterm{Bayesian network}\footnote{This is
    the most common name, often abbreviated to ``Bayes net,'' but there are many synonyms, including
    \term{belief network}\tindex{belief network|see{Bayesian network}},
    \term{probabilistic network}\tindex{probabilistic network|see{Bayesian
    network}},
    \term{causal network}\tindex{causal network|see{Bayesian network}}, and
    \term{knowledge map}\tindex{knowledge map|see{Bayesian network}}.

\item Each node \(X_i\) has a conditional probability distribution 
\(\pv(X_i\given \Parents(X_i))\) that quantifies the
effect of the parents on the node using a finite number of \newterm[parameter]{parameters}.

The conditional distributions in \figref{burglary-figure} are shown as
a \newterm{conditional probability table}\ntindex{conditional
probability table}, or CPT.

Each row in a CPT contains the
conditional probability of each node value for a \newterm{conditioning
case}.


This identity is called the \newterm{chain rule}.

This
last condition is satisfied by numbering the nodes in \newtermi[topological ordering]{topological order},
i.e., in any order consistent with the directed graph structure.

The compactness of Bayesian
networks is an example of a  general property of
\newterm{locally structured} (also called \newterm{sparse}) systems.

specifies that


each variable is conditionally independent of its
non-\newterm[descendant]{descendants},
given its parents.


Another important independence property is implied by the topological semantics:


a node is conditionally independent of all other nodes in the
network, given its parents, children, and children's parents---that is,
given its \newterm{Markov blanket}.

Usually, such
relationships are describable by a
\newterm{canonical distribution}\ntindex{canonical
distribution} that fits some standard pattern.

The simplest example is provided by \newterm{deterministic
nodes}.

Another important pattern that occurs often in practice is \newterm{context-specific
  independence}
or CSI}.

The standard example is the
\newterm{noisy-OR} relation, which is a
generalization of the logical OR.

(If some are missing, we can always add
a so-called \newterm{leak node} that covers
``miscellaneous causes.'')

One possible way
to handle continuous variables is to avoid them by using
\newterm{discretization}---that is, dividing
up the possible values into a fixed set of intervals.

Yet another
solution---sometimes called a
\newterm{nonparametric}
representation---is to define the conditional distribution implicitly
with a collection of instances, each containing specific values of the
parent and child variables.

A network with both discrete and continuous variables is called a
\newterm{hybrid Bayesian network}.

The most common choice is the
\newterm{linear Gaussian} distribution, in
which the child has a Gaussian distribution whose mean \(\mu\) varies
linearly with the value of the parent and whose standard deviation
\(\sigma\) is fixed.

When discrete variables are added as parents (not as children) of continuous variables, the network
defines a \newterm{conditional Gaussian},
or CG, distribution: given any assignment to the
discrete variables, the distribution over the continuous
variables is a multivariate Gaussian.

This \newterm{probit model} (pronounced ``pro-bit'' and short for ``probability unit'')
is illustrated in \figref{probit-logit-figure}(a).


An alternative to the probit model is the \newterm{inverse logit}
or \newterm{expit} model.

It uses the
\newtermi{logistic function} \(1/(1+e^{{-}x})\) to produce a soft
threshold.

The Bayes net will include \newterm[hidden
  variable]{hidden variables} that are
neither input nor output variables, but are essential for structuring
the network so that it is reasonably sparse with a manageable number
of parameters.

\section{Exact Inference in Bayesian Networks}

The basic task for any probabilistic inference system is to compute
the posterior probability distribution
for a set of \term{query variables}\tindex{query variable}, given some
observed \newterm{event}---usually,
some assignment of values to a set of \term{evidence
variables}\tindex{evidence variable}.\footnote{Another widely studied
  task is finding the \term{most probable explanation} for some
  observed evidence.

There are several versions of this approach;
we present the \newterm{variable
elimination} algorithm, which is the
simplest.

Notice that we have annotated each part of the expression with the
name of the corresponding \newterm{factor}\ntindex{factor
(in variable elimination)}; each factor is a matrix indexed by the
values of its argument variables.


Here the ``\({\times}\)'' operator is not ordinary matrix multiplication but instead the \newterm{pointwise
product} operation, to be described shortly.

These are called \newterm{singly connected}\ntindex{singly connected
network} networks or \newterm[polytree]{polytrees}, and they have a
particularly nice property:

The time and space complexity of exact inference in polytrees is linear in
the size of the network.

For \newterm{multiply connected}
networks, such as that of \figref{rain-clustering-figure}(a), variable
elimination can have exponential time and space complexity in the
worst case, even when the number of parents per node is bounded.

(In the language of complexity theory, we
\newtermi[reduction]{reduce} satisfiability problems to Bayes net
inference problems.)

In this case, the reduction is to
a particular form of SAT solving called \newterm{weighted model
  counting} (WMC).

Using \newterm{clustering}
algorithms (also known as \newterm{join tree}
algorithms), the time can be reduced to \(O(n)\).

This section describes randomized sampling
algorithms, also called \newterm{Monte Carlo}
algorithms, that provide approximate answers whose accuracy depends on
the number of samples generated.

Such an estimate is called \newterm{consistent}\ntindex{consistent
estimation}.

\subsubsection{Rejection sampling in Bayesian networks}

\newterm{Rejection sampling} is a
general method for producing samples from a hard-to-sample
distribution given an easy-to-sample distribution.

\subsubsection{Likelihood weighting}


\newterm{Likelihood weighting} avoids
the inefficiency of rejection sampling by generating only events that
are consistent with the evidence \(\e\).

It is a particular instance
of the general statistical technique of \newtermi{importance sampling},
tailored for inference in Bayes nets.


\subsection{Inference by Markov chain simulation}


\newterm{Markov chain Monte
Carlo} (MCMC) algorithms work quite differently from
rejection sampling and likelihood weighting.

The term \newterm{Markov chain} just refers to 
a random process that generates a sequence of states.

We begin by describing a particular form of
MCMC called \newtermi{Gibbs sampling}, which is especially well suited
for Bayes nets.

We then describe a more general form called 
\newtermi{Metropolis--Hastings} sampling, which allows for the injection of a
litte more ingenuity into the way the samples are generated.

Any such chain is defined by its initial state and its 
\newtermi{transition kernel} \(\transition{\x}{\x'}\), i.e., the probability 
of a transition to state \(\x'\) starting from state \(\x\).


We say that the chain has reached its
\newterm{stationary distribution}
 if
\(\pi_t\eq \pi_{t+1}\).


Provided the transition kernel \(\kernelsymbol\) is \newtermi{ergodic}---that is, every state is reachable from every
other and there are no strictly periodic cycles---there is exactly
one distribution \(\pi\) satisfying this equation for any given \(\kernelsymbol\).


When these equations hold, we say that \(\transition{\x}{\x'}\)
is in \newterm{detailed balance} with
\(\pi(\x)\).


The rate of convergence for Gibbs sampling---the \newtermi{mixing rate} of the Markov chain defined
by the algorithm---depends strongly on the quantitative properties of
the conditional distributions in the network.

One is \newtermi{block
  sampling}: sampling multiple variables simultaneously.

Like simulated
annealing (\pgref{simulated-annealing-algorithm}), MH has two stages in each iteration of the sampling process:

\item Sample a new state \(\x'\) from a \newtermi{proposal distribution} \(\proposal{\x}{\x'}\),
given the current state \(\x\).

\item Probabilistically accept or reject \(\x'\) according to the \newterm{acceptance probability}

   \mhacceptance{\x}{\x'} = \min\ \left(1,\frac{\pi(\x')\proposal{\x'}{\x}}{\pi(\x)\proposal{\x}{\x'}}  \right)\ .

Researchers are currently working


The reduced problem is described by some


The physician reportedly found it easy to think in


He developed a fully articulated theory
of causality based on \newterm{causal graphs}~\cite{Pearl:2000}.

Another important application area is
biology: the mathematical models used to analyze genetic inheritance in
family trees (so-called \newtermi{pedigree analysis}) are in fact a special form of Bayesian networks.

A
similar approach, developed by the statisticians David Spiegelhalter
and Steffen Lauritzen~\cite{Lauritzen+Spiegelhalter:1988}, is based on
conversion to an undirected form of graphical model called a
\newtermi{Markov network}.

\citeA{Dechter:1999} shows how the variable elimination idea is
essentially identical to \newterm{nonserial dynamic
programming}\ntindex{dynamic
programming!nonserial}~\cite{Bertele+Brioschi:1972}.

The \newterm{most probable explanation} or MPE}
is the most likely assignment to the nonevidence variables given the evidence.

In statistics, \newterm{adaptive sampling}
has been applied to all sorts of Monte Carlo algorithms to speed up convergence.

The first is the family of 
\newterm{variational approximation} methods, which
can be used to simplify complex calculations of all kinds.

The reduced problem is described by some
\newterm[variational parameter]{variational parameters} \(\blambda\) that are adjusted
to minimize a distance function \(D\) between the original and the
reduced problem, often by solving the system of equations \(\partial
D/\partial \blambda \eq 0\).

In statistical physics, the \newterm[mean field]{mean-field} method
is a particular variational approximation in which the individual variables
making up the model are assumed to be completely independent.

Little attention was paid to this so-called
\newterm{loopy belief propagation}}
approach until \citeA{McEliece+al:1998} observed that it is exactly
the computation performed by the \newtermi{turbo decoding}
algorithm~\cite{Berrou+al:1993}, which provided a major breakthrough
in the design of efficient error-correcting codes.

\newterm{Possibility theory}~\cite{Zadeh:1978} was
introduced to handle uncertainty in fuzzy systems and has much in
common with probability~\cite{Dubois+Prade:1994}.

\subsection{States and observations}

This chapter discusses \newterm[discrete
  time]{discrete-time} models in which the
world is viewed as a series of snapshots or \newterm[time slice]{time
  slices}.\footnote{Uncertainty over
  {\em continuous} time can be modeled by \term{stochastic
    differential equations}\tindex{differential equation!stochastic}
  (SDEs).

We solve the problem by making a \newterm{Markov
assumption}---that the current state depends on only a {\em finite fixed number}
of previous states.

Processes satisfying this assumption were
first studied in depth by the Russian statistician Andrei Markov (1856--1922) and
are called \newterm[Markov process]{Markov processes} or
\term{Markov chains}.

They come in various flavors;
the simplest is the \newterm{first-order Markov process}, in which the
current state depends only on the previous state and not on any
earlier states.

We avoid this problem by assuming that changes in the world
state are caused by a \newterm{time-homogeneous} process---that is, a
process of change that is governed by laws that do not themselves
change over time.

Thus,
we make a \newterm{sensor Markov assumption} as follows:

  \pv(\E_t\given \X_{0:t},\E_{0:t-1}) =  \pv(\E_t\given \X_t)\ .

We will return to the battery example in \secref{dbn-section}.


\section{Inference in Temporal Models}


Having set up the structure of a generic temporal model, we can
formulate the basic inference tasks that must be solved:

\plainnewtermitem{Filtering}\footnote{The term ``filtering'' refers to the roots of this
problem in early work on signal processing, where the problem is to
filter out the noise in a signal by estimating its underlying
properties.}

or \newtermi{state estimation} is the task of computing the
\newterm{belief state} \(\pv(\X_t\given \e_{1:t})\)---the posterior
distribution over the most recent state given all evidence to
date.

\newtermitem{Prediction} This is the task
of computing the posterior distribution over the {\em future} state,
given all evidence to date.

\plainnewtermitem{Smoothing}: This is the task
of computing the posterior distribution over a {\em past} state, given
all evidence up to the present.

A great deal is known about the
properties of such distributions and about the \newtermi{mixing
time}---roughly, the time taken to reach the fixed point.

The algorithm, aptly called the
\newterm{forward--backward algorithm}, is shown in \figref{forward-backward-algorithm}.

The most common
requirement is for \newtermi{fixed-lag smoothing}, which requires
computing the smoothed estimate \(\pv(\X_{t-d}\given \e_{1:t})\) for fixed
\(d\).

The algorithm we have just described is called the \newtermi{Viterbi
algorithm}, after its inventor.

We begin with the \newtermi{hidden Markov model}, or \termi{HMM}.

This section examines methods
for handling continuous variables, using an algorithm called
\newterm{Kalman filtering}, after one of its inventors,
Rudolf E.  Kalman\nindex{Kalman, R.}.

The reader might wish to
consult \appref{math-appendix} for some of the mathematical
properties of Gaussian distributions; for our immediate purposes, the
most important is that a
\newterm{multivariate Gaussian} distribution for \(d\) variables is
specified by a \(d\)-element mean \(\mean\) and a \(d\stimes d\) covariance
matrix \(\covariance\).

A
simple trick known as \newtermi{completing the square} allows the
rewriting of any quadratic \(ax{}_0^2 + bx{}_0 + c\) as the sum of a squared
term \(a(x_0-\frac{-b}{2a})^2\) and a residual term \(c-\frac{b^2}{4a}\)
that is independent of \(x_0\).

Now the update equations for the mean and covariance, in
their full, hairy horribleness, are 

{rcl}
\mean_{t+1} &=& \kftm\mean_t + \kfgm_{t+1}(\z_{t+1} - \kfsm\kftm\mean_t)\\
\covariance_{t+1} &=& (\I-\kfgm_{t+1}\kfsm)(\kftm\covariance_t\kftm\transpose+\kftv)\ ,


where \(\kfgm_{t+1}\eq (\kftm\covariance_t\kftm\transpose+\kftv)
\kfsm\transpose(\kfsm(\kftm\covariance_t\kftm\transpose+\kftv)\kfsm\transpose +\kfsv)^{-1}\) 
is called the \newtermi{Kalman gain matrix}.

The \newtermi{extended Kalman filter (EKF)} attempts to overcome
nonlinearities in the system being modeled.

A system is \newtermi{nonlinear} if
the transition model cannot be described as a matrix multiplication of
the state vector, as in \eqref{kalman-linear-system-equation}.

Within the control theory community, for which problems such
as evasive maneuvering by aircraft raise the same kinds of
difficulties, the standard solution is the \newterm{switching Kalman
filter}.


\section{Dynamic Bayesian Networks}


\newterm[dynamic Bayesian network]{Dynamic Bayesian networks}\ntindex{Bayesian
network!dynamic}, or \term{DBNs}, extend the semantics of standard Bayesian networks to handle temporal probability models of the kind described in
\secref{time+uncertainty-section}.

We  use the term \newtermi{Gaussian error model} to
cover both the continuous and discrete versions.

The simplest kind of failure is called a \newtermi{transient
failure}, where the sensor occasionally decides to send some nonsense.

Let's call this the \newtermi{transient
failure model}.


Unsurprisingly, to handle persistent failure, we  need a
\newtermi{persistent failure model} that describes how the sensor
behaves under normal conditions and after failure.

This \newtermi{persistence arc} has a CPT that
gives a small probability of failure in any given time step, say,
0.001, but specifies that the sensor stays broken once it breaks.

This approach is called \newterm{sequential importance sampling} or SIS}.

A family of algorithms called \newtermi{particle filtering} is designed
to do just that.

More specifically,

\lefteqn{\pv(\J{Dirt}_{1,0:t},\ldots,\J{Dirt}_{42,0:t}\given
\J{DirtSensor}_{1:t},\J{WallSensor}_{1:t},\J{Location}_{1:t})} \\
& = & \prod_i \pv(\J{Dirt}_{i,0:t} \given \J{DirtSensor}_{1:t},\J{Location}_{1:t})


This means it is useful to apply a statistical trick
called \newtermi{Rao-Blackwellization}, which is based on the simple
idea that exact inference is always more accurate than sampling, even
if it's only for a subset of the variables.

This approach is called the the \newterm{Rao-Blackwellized
particle filter}
and works well for
both stochastic and deterministic dirt worlds, as shown in
\figref{rbpf-dirt-figure}.

In the control theory literature, this is the \newtermi{data association} problem---that is, the problem
of associating observation data with the objects that generated them.

For
choosing the ``best'' assignment, it is common to use the so-called
\newtermi{nearest-neighbor filter}, which repeatedly chooses the
closest pairing of predicted position and observation and adds that
pairing to the assignment.

This can be done very efficiently
using the \newtermi{Hungarian algorithm}~\cite{Kuhn:1955}, even though
there are \(n!\) assignments to choose from.

Often, the reported
observations include \newterm[false alarm]{false
alarms} (also known as
\newterm{clutter}), which are
not caused by real objects.

\newterm[detection failure]{Detection
failures} can occur,
meaning that no observation is reported for a real object.

}~\citeyear{Kanazawa+al:1995} includes an
improvement called \newterm{evidence reversal} whereby the state at time
\(t+1\) is sampled conditional on both the state at time \(t\) and
the evidence at time \(t+1\).

The latter is a hybrid of particle filtering with a much older idea
called \newterm{assumed-density filter}.

For DBNs, the Boyen--Koller
algorithm~\cite{Boyen+al:1999} and the \newtermi{factored frontier}
algorithm \cite{Murphy+Weiss:2001} assume that the posterior
distribution can be approximated well by a product of small factors.

The \newtermi{particle MCMC} family of algorithms~\cite{Andrieu+al:2010,Lindsten+al:2014}
combines MCMC on the unrolled temporal model with particle filtering to generate the MCMC proposals;
although it provably converges to the correct posterior distribution in the general case (i.e., with both static and dynamic variables),
it is an offline algorithm.

To avoid the problem of increasing update times
as the unrolled network grows, the \newterm{decayed MCMC}
filter~\cite{Marthi+al:2002} prefers to sample more recent state
variables, with a probability that decreases for variables further in the past.

The book by \citeA{Doucet+al:2001} collects many important papers
on \newterm{sequential Monte Carlo}
(SMC) algorithms, of which particle filtering is the most important
instance.

We will call models defined in this way
\newterm[relational probability model]{relational probability models}, 
or RPMs.\footnote{The name {\em relational probability model} was given by \citeA{Pfeffer:2000} to
a slightly different representation, but the underlying ideas are
the same.}

We will 
also assume a \newtermi{type signature} for each function,
that is, a specification of the type of each argument and the
function's value.

Given the constants and their types, together with the functions and
their type signatures, the \newterm[basic random variable]{basic random variables} of the RPM are
obtained by instantiating each function with each
possible combination of objects: \(\J{Honest}(C_1)\),
\(\J{Quality}(B_2)\), 
\(\J{Recommendation}(C_1,B_2)\), and so on.

The conditional distribution for
\(\J{Recommendation}(C_1,B_2)\) is then essentially a
\newtermi{multiplexer} in which the \(\J{Author}(B_2)\) parent acts as
a selector to choose which of \(\J{Fan}(C_1,A_1)\) and
\(\J{Fan}(C_1,A_2)\) actually gets to influence the recommendation.

Uncertainty in the value of \(\J{Author}(B_2)\),
which affects the dependency structure of the network, is an instance of \newterm{relational uncertainty}.

\code{
\k{loop for} \var{b} \k{from} 1 \k{to} \var{B} \var{do}
   add node \(\J{Quality}_b\) with no parents, prior \(\langle 0.05,0.2,0.4,0.2,0.15 \rangle\)
\k{loop for} \var{c} \k{from} 1 \k{to} \var{C} \var{do}
   add node \(\J{Honest}_c\) with no parents, prior \(\langle 0.99,0.01 \rangle\)
   add node \(\J{Kindness}_c\) with no parents, prior \(\langle 0.1,0.1,0.2,0.3,0.3 \rangle\)
   \k{loop for} \var{b} \k{from} 1 \k{to} \var{B} \var{do}
      add node \(\J{Recommendation}_{c,b}\) with parents \(\J{Honest}_c,\J{Kindness}_c,\J{Quality}_b\)
            and conditional distribution \(\J{RecCPT}(\J{Honest}_c,\J{Kindness}_c,\J{Quality}_b)\)
}

\noindent This technique is called
\newterm{grounding} or
\newterm{unrolling}; it is the exact analogue of
\term{propositionalization} for first-order logic (\pgref{propositionalization-section}).

In the computer security field,
these multiple IDs are called \newterm[sibyl]{sibyls}
and their use to confound a reputation system is called a
\newtermi{sibyl attack}.

Thus, even a simple application in a
relatively well-defined, online domain involves both \newtermi{existence uncertainty} (what are the real books and customers underlying the observed data)
and \newtermi{identity uncertainty} (which logical terms really refer to the same object).

For these reasons, we need to be able to write so-called \newterm[open
universe]{open-universe}
probability models or OUPMs
based on the standard semantics of first-order
logic, as illustrated at the top of \figref{all-models-both-figure}.

One way to do this in OUPMs is to provide \newtermi[number statement]{number statements} that specify conditional distributions over the numbers of objects of
various kinds.


This number statement specifies the distribution over the number of login IDs for each customer;
the \(\J{Owner}\) function is called an \newtermi{origin function} because it says where each generated object
came from.

The most important such
language was \system{Bugs} (Bayesian inference Using Gibbs
Sampling)~\cite{Gilks+al:1994}, which combined Bayesian networks with
the \newterm{indexed random variable}
notation common in statistics.

In statistics, the problem of \newterm{record linkage}
arises when data records do not contain standard unique
identifiers---for example, various citations of this book might name
its first author ``Stuart Russell'' or ``S.~J.~Russell'' or even
``Stewart Russle,'' and other authors may use the some of the same
names.

The agent's preferences are captured by a \newterm{utility
  function}\tindex{utility!function}, \(U(s)\),
which assigns a single number to express the desirability of a
state.

The \newterm{expected utility} of an action given the evidence,
\(\J{EU}(a)\), is just the average utility value of the outcomes,
weighted by the probability that the outcome occurs:

\J{EU}(a) = \sum\limits_{s'}  \dtoutcome{a}{s'} \, U(s')\ .


The principle of \newterm{maximum expected utility}\ntindex{maximum
expected utility} (MEU) says that a rational agent should choose the
action that maximizes the agent's expected utility:

\J{action} = \argmax_a \J{EU}(a)\ .

We can think of the set of outcomes for each
action as a \newtermi{lottery}---think of each action as a ticket.

To
address this issue we list six constraints that we require any reasonable preference relation to
obey:


  \newtermitem{Orderability} Given any two
  lotteries, a rational agent must either prefer one to the other or else
  rate the two as equally preferable.


\newtermitem{Transitivity} Given any three lotteries, if an agent prefers
\(A\) to \(B\) and prefers \(B\) to \(C\), then the agent must prefer \(A\) to
\(C\).


\newtermitem{Continuity} If some lottery \(B\) is between \(A\) and \(C\) in
preference, then there is some probability \(p\) for which the rational
agent will be indifferent between getting \(B\) for sure and the lottery
that yields \(A\) with probability \(p\) and \(C\) with probability \(1-p\).


\newtermitem{Substitutability}  If an agent is indifferent between two
lotteries \(A\) and \(B\), then the agent is indifferent between two
more complex lotteries that are the same except that \(B\) is substituted for
\(A\) in one of them.

\newtermitem{Monotonicity}  Suppose  two lotteries have
the same two possible outcomes, \(A\) and \(B\).


\newtermitem{Decomposability} Compound lotteries can be reduced to
simpler ones using the laws of probability.

This is called a
\newterm{value
function} or \newterm{ordinal utility
function}\index{ordinal
utility}.

This process, often called
\newterm{preference elicitation},
involves presenting choices to the human and using the observed
preferences to pin down the underlying utility function.

\index{u1worst@$\uworst$ (worst
catastrophe)} \newterm{Normalized
utilities} use a scale with \(\uworst=0\)
and \(\ubest=1\).

Given a utility scale between \(\ubest\) and \(\uworst\), 
we can assess the utility of any particular prize  \(S\) 
by asking the agent to choose
between  \(S\) and a \newterm{standard
  lottery} \([p,\ubest;\, (1-p),\uworst]\).

Currently several agencies of the US government, including the
Environmental Protection Agency, the Food and Drug Administration, and
the Department of Transportation, use the \newterm{value of a
  statistical life} to determine
the costs and benefits of regulations and interventions.

One common ``currency'' used in medical and
safety analysis is the \newterm{micromort}, a one
in a million chance of death.

Another measure is the \newterm{QALY}, or
quality-adjusted life year.

We say that the agent exhibits a
\newterm{monotonic preference} for more
money.

Assuming the coin is fair, the \newterm{expected
  monetary value}
(EMV) of the gamble is
\(\frac{1}{2}\)({\DollarSign}0) +
\(\frac{1}{2}\)({\DollarSign}2,500,000) \(=\) {\DollarSign}1,250,000,
which is more than the original 
{\DollarSign}1,000,000.


That is, agents with curves of this shape are
\newterm{risk-averse}: they prefer a sure
thing with a payoff that is less than the expected
monetary value of a gamble.

On the other hand, in the ``desperate''
region at large negative wealth in \figref{utility-curve-figure}(b),
the behavior is \newterm{risk-seeking}.

The
value an agent will accept in lieu of a lottery is called the
\newterm{certainty equivalent} of the
lottery.

The difference between the EMV
of a lottery and its certainty equivalent is called the
\newterm{insurance premium}.

An agent that
has a linear curve is said to be \newterm{risk-neutral}\ntindex{risk
neutrality}.

We will assume, kindly perhaps, that the
estimates are \newterm{unbiased}, that
is, the expected value of the error, \(E(\widehat{\J{EU}}(a) -
\J{EU}(a))\), is zero.

(This calculation is a special case of computing 
an \newtermi{order statistic}, the distribution of any particular ranked element of a sample.)

This tendency for the estimated expected
utility of the best choice to be too high is called the
\newtermi{optimizer's curse}
\cite{Smith+Winkler:2006}.

\subsection{Human judgment and irrationality}

Decision theory is a \newtermi{normative theory}: it describes how a
rational agent {\em should} act.

A \newtermi{descriptive
theory}, on the other hand, describes how actual agents---for example, humans---really do act.

One explanation for the apparently irrational preferences is the
\newtermi{certainty effect} \cite{Kahneman+Tversky:1979}: people are
strongly attracted to gains that are certain.

People know they would experience \newtermi{regret}
if they gave up a certain reward (\(B\)) for an 80\
higher reward and then lost.

It
seems that people have \newtermi{ambiguity aversion}: \(A\) gives
you a 1/3 chance of winning, while \(B\) could be anywhere between 0
and 2/3.

Yet another problem is that the exact wording of a decision problem can have a big
impact on the agent's choices; this is called the
\newtermi{framing effect}.

This is called the \newtermi{anchoring effect}.


\noindent The evidence for human irrationality is also questioned by
researchers in the field of \newtermi{evolutionary psychology}, who
point to the fact that our brain's decision-making mechanisms did not
evolve to solve word problems with probabilities and prizes stated as
decimal numbers.

Problems like
these, in which outcomes are characterized by two or more attributes,
are handled by \newterm{multiattribute utility
  theory}.

We then say that there is \newterm{strict
dominance} of \(S_1\) over \(S_2\).

Fortunately, there is a more useful generalization called
\newterm{stochastic
  dominance}, which occurs very frequently
in real problems.

Although we will not present them here, there exist algorithms for
propagating this kind of qualitative information among uncertain
variables in \newterm{qualitative probabilistic
  networks}, enabling a
system to make rational decisions based on stochastic dominance,
without using any numeric values.

The basic approach is to
identify regularities in the preference behavior we would expect to
see and to use what are called
\newterm[representation theorem]{representation theorems} to show that an agent with a certain
kind of preference structure has a utility function
U(x_1,\ldots,x_n) = F[f_1(x_1),\ldots, f_n(x_n)]\ ,
where \(F\) is, we hope, a simple function such as addition.

The basic regularity
that  arises in deterministic preference structures is called
\newterm{preference independence}.


We say that the set of attributes \(\{\J{Quietness},\J{Frugality},\J{Safety}\}\) exhibits
\newterm{mutual preferential independence}\ntindex{mutual preferential
independence (MPI)} (MPI\index{MPI (mutual preferential
independence)}).


A value function of this type is called an \newterm{additive value
function}.

The basic notion of \newterm{utility independence} extends preference
independence to cover lotteries: a set of attributes \(\mbf{X}\) is
utility independent of a set of attributes \(\mbf{Y}\) if preferences
between lotteries on the attributes in \(\mbf{X}\) are independent of the
particular values of the attributes in \(\mbf{Y}\).

A set of attributes is \newterm{mutually utility independent} (MUI) if each
of its subsets is utility-independent of the remaining attributes.

MUI implies that the agent's behavior can be described using a
\newterm{multiplicative utility function}~\cite{Keeney:1974}.

The notation is often called an \newtermi{influence
diagram}~\cite{Howard+Matheson:1981}, but we will use the more descriptive
term \newterm{decision network}.

It illustrates the three types of nodes used:

\plainnewtermitem{Chance nodes} (ovals) represent random variables, just as they
do in Bayesian networks.

\plainnewtermitem{Decision nodes} (rectangles) represent points where 
the decision maker has a choice of actions.

\plainnewtermitem{Utility nodes} (diamonds) represent the
agent's utility function.\footnote{These nodes are also called
\term{value nodes}\tindex{value node|see{utility node}} in the literature.

In this case, rather than representing a utility function on outcome states,
the utility node represents the {\it
expected} utility associated with each action, as defined in
\eqref{meu-equation} on \pgref{meu-equation}; that is, the node is associated with
an \newtermi{action-utility function} (also known as a \termi{Q-function} in reinforcement learning, as described in \chapref{reinforcement-learning-chapter}).

This section describes \newterm{information value
theory}, which enables an agent to choose
what information to acquire.

We assume that exact evidence can be obtained about the
value of some random variable \(E_j\) (that is, we learn \(E_j =
e_{j}\)), so the phrase \newterm{value of perfect
information}
(VPI) is
used.\footnote{There is no loss of expressiveness in requiring perfect
information.


The agent algorithm we have described implements a form of information
gathering that is called \newterm{myopic}.

Here, we present one such case: the \newtermi{treasure hunt} problem (or the \term{least-cost testing sequence} problem, for the less romantically inclined).
There are \(n\) locations \(1,\ldots,n\); each location \(i\) contains treasure
with independent probability \(P(i)\); and it costs \(C(i)\) to check location \(i\).

Joint entropy is submodular


\subsection{Sensitivity analysis and robust decisions}

The practice of \newtermi{sensitivity analysis} is widespread in
technological disciplines: it means analyzing how much the output of a
process changes as the model parameters are tweaked.

These intuitions have been formalized in several fields (control theory, decision analysis, risk management) 
that propose the notion of a \newterm{robust} or \term{minimax} decision, i.e., one that gives the best result in the worst case.


A field of \newterm{decision analysis} emerged
to aid in making policy decisions more rational in areas such as military
strategy, medical diagnosis and public health, engineering design, and
resource management.

The process involves a \newterm{decision maker} who states
preferences between outcomes and a \newterm{decision
analyst} who enumerates the possible actions and
outcomes and elicits preferences from the decision maker to determine
the best course of action.

The same underlying concept has been
called \newtermi{post-decision disappointment} by
\citeA{Harrison+March:1984} and was noted in the context of analyzing
capital investment projects by \citeA{Brown:1974}.

The optimizer's
curse is also closely related to the \newtermi{winner's
curse} \cite{Capen+al:1971,Thaler:1992},
which applies to competitive bidding in auctions: whoever wins the
auction is very likely to have overestimated the value of the object
in question.

\citeA{Cyert+DeGroot:1979} developed a theory of
\newterm{adaptive utility} in which an agent could be
uncertain about its own utility function and could obtain more information
through experience.

\cite{Fern+al:2014}
propose a decision-theoretic model of \newtermi{assistance} in which a robot
tries to ascertain and assist with a human goal about which it is initially
uncertain.

Whereas \chapref{decision-theory-chapter} was
concerned with one-shot or episodic decision problems, in which the
utility of each action's outcome was well known, we are concerned
here with \newterm[sequential decision problem]{sequential decision
problems}, in which the agent's utility depends
on a sequence of decisions.

Later in this section, we  investigate the nature of
utility functions on hitories; for now, we  simply
stipulate that for every transition from \(s\) to \(s'\) via action \(a\), the agent receives a \newtermi{reward}
\(\reward{s}{a}{s'}\).

To sum up:  a sequential decision problem 
for a fully observable, stochastic environment with a Markovian transition model
and additive rewards is called a
\newterm{Markov decision process},
or \term{MDP}, and consists of
a set of states (with an initial state \(s_0\)); a set \(\noprog{Actions}(s)\) of actions in each state; a transition model \(\transprob{s}{a}{s'}\); and a
reward function \(\reward{s}{a}{s'}\).

Methods for solving MDPs
usually fall under the heading of \newtermi{dynamic programming}.

A solution of this kind is called a 
\newterm{policy}.

An \newterm{optimal
policy} is a policy that yields the highest
expected utility.

The first question to answer is whether there is a
\newterm{finite horizon} or an \newterm{infinite horizon} for decision
making.


We say that the optimal policy for a finite horizon is
\newterm[nonstationary policy]{nonstationary}.

Hence,
the optimal action
depends only on the current state, and the optimal policy is
\newterm[stationary policy]{stationary}.

Throughout this chapter, we will use 
\newterm[additive discounted reward]{additive discounted rewards}: the utility of a history is
defined to be

   U_h([s_0,a_0,s_1,a_1,s_2,\ldots]) = \reward{s_0}{a_0}{s_1} + \gamma \reward{s_1}{a_1}{s_2} + \gamma^2 \reward{s_2}{a_2}{s_3} + \cdots\ ,

where the \newtermi{discount factor} \(\gamma\) is a number between~0
and~1.

When
\(\gamma\) is exactly~1, discounted rewards reduce to the special case of purely \newterm[additive reward]{additive rewards}.

There is, however, a highly plausible preference-independence
assumption that can be made, namely that the agent's
preferences between state sequences are \newterm[stationary
preference]{stationary}\ntindex{stationarity (for
preferences)}.

A policy that is guaranteed to reach a
terminal state is called a \newterm{proper
policy}.

\item  Infinite sequences can be compared in terms of
the \newtermi{average reward} obtained per time step.


This is called the \newtermi{Bellman equation},
after Richard Bellman~\citeyear{Bellman:1957}.

Another important quantity is the \termi{action-utility function}, or     
\newterm[Q-function]{Q-function}: \(Q(s, a)\) is the expected utility of taking a given action
in a given state.

Then, according to the \newtermi{shaping theorem}, the following transformation
leaves the optimal policy unchanged:

  R'(s,a,s') = \reward{s}{a}{s'} + \gamma \Phi(s') - \Phi(s)\ .

Just as in \chapref{decision-theory-chapter}, where Bayesian networks were extended with action
and utility nodes to create decision networks, we can represent MDPs by extending dynamic Bayesian networks (DBNs, see \chapref{dbn-chapter})
with decision, reward, and utility nodes to create \newterm[dynamic decision network]{dynamic decision networks},  or
DDNs.

The fourth, \newterm{Monte Carlo planning}, computes an approximately optimal policy by sampling possible futures.

\subsection{Value Iteration}

The Bellman equation (\eqref{utility-dp-equation}) is the basis of the \newtermi{value iteration} algorithm for
solving MDPs.

The iteration step, called a \newtermi{Bellman update},
looks like this:

U_{i+1}(s) \leftarrow \max_{a\in A(s)} \sum\limits_{s'} \transprob{s}{a}{s'}[\reward{s}{a}{s'} + \gamma\,U_i(s')]\ ,


where the update is assumed to be applied simultaneously to all the states at each iteration.

The basic concept used in showing that value iteration converges is
the notion of a \newterm{contraction}.

We will use the \newtermi{max norm}, which measures the ``length'' of a
vector by the absolute value of its biggest component:

  ||U|| = \max_s |U(s)|\ .

\(U^{\pi_i}(s)\) is the utility obtained if \(\pi_i\) is executed
starting in \(s\), and the \newterm{policy loss} \(||U^{\pi_i} - U||\)
is the most the agent can lose by executing \(\pi_i\) instead of the
optimal policy \(\pistar\).

The
\newtermi{policy iteration} algorithm
alternates the following two steps, beginning from some initial policy
\(\pi_0\):

\item \newterm{Policy evaluation}: given a policy \(\pi_i\), calculate \(U_i\eq U^{\pi_i}\),
the utility of each state if \(\pi_i\) were to be executed.

\item \newterm{Policy improvement}: Calculate a new MEU policy \(\pi_{i+1}\), 
using one-step look-ahead
based on \(U_i\) (as in \eqref{optimal-policy-equation}).

The resulting algorithm is called \newterm{modified policy
iteration}.
It is often much more efficient than standard policy iteration or
value iteration.

This very
general algorithm is called \newterm{asynchronous policy
iteration}.

This general approach is sometimes called \newterm{real-time dynamic
  programming} or
\noprog{RTDP}} and is quite
analogous to LRTA{\star}} in \chapref{advanced-search-chapter}.

An \newterm{\(n\)-armed bandit} has \(n\) levers.

This is an example of the ubiquitous tradeoff between
\newterm{exploitation} of the current best action to obain rewards
and \newterm{exploration}\tindex{exploration} of previously unknown states and actions
to gain information, which can in some cases be converted into a better policy
and better long-term rewards.

There are several different definitions of bandit problems; perhaps the cleanest is as follows:

\item Each arm \(M_i\) is a \newtermi{Markov reward process} or MRP, i.e., an MDP with only one possible action \(a_i\).

This is called the \newterm{one-armed bandit}, 
as shown in \figref{gittins-figure}(a).

Let's assume that, at some finite \newtermi{stopping time} time \(T\), the optimal strategy 
eventually pulls the second arm for the first time.

This means that we can think of the optimal strategy as a \newtermi{stopping rule} that runs \(M\) up to time \(T\) and then switches to
\(M_\lambda\) for the rest of time.

The value defined in \eqref{gittins-index-equation} is called the \newtermi{Gittins index} of \(M\).

\subsection{The Bernoulli bandit}

Perhaps the simplest and best-known instance of a bandit problem is
the \newterm{Bernoulli bandit}, where each arm \(M_i\)
produces a reward of 0 or 1 with a fixed but unknown probability \(\mu_i\).

The results are intuitively
reasonable: we see that, generally speaking, arms with higher payoff
probabilities are preferred, but there is also an
\newtermi{exploration bonus} associated with arms that have only been
tried a few times.

The first class of methods uses the \newtermi{upper confidence bound}
or UCB} heuristic, which we have
seen already in the context of Monte Carlo tree search
(\figref{mcts-algorithm} on \pgref{mcts-algorithm}).

A second method, called \newtermi{Thompson sampling}, is as old as the bandit problem itself~\cite{Thompson:1933}.

This task is called a \newterm{selection problem}.

Perhaps the most important generalization of the bandit process is the \newtermi{bandit superprocess} or BSP}, in which each arm is
a full Markov decision process in its own right, rather than being a Markov reward process with only one possible action.

The ordinary term for this is \newterm{multitasking}.

To some extent, it can be modeled by the
notion of \newterm{opportunity cost}: how much utility is given up per
time step by not devoting that time step to another arm.

Such an optimal policy, if it exists, is
called a \newterm{dominating policy}.

For these reasons,
\newterm[partially observable MDP]{partially observable
MDPs} (or
POMDPs---pronounced
``pom-dee-pees'')
are usually viewed as much more difficult than
ordinary MDPs.

Notice that four of the plans, shown as dashed lines, are
suboptimal across the entire belief space---we say these plans are
\newterm[dominated plan]{dominated},
and they need not be considered further.

The combination of particle filtering and UCT applied to POMDPs goes under the name of
partially observable Monte Carlo planning or \newtermi{POMCP}.

The general family of \newtermi{prioritized sweeping} algorithms aims
to speed up convergence to optimal policies by heuristically ordering the
value and policy update calculations~\cite{Moore+Atkeson;1993,Andre+al:1998,Wingate+Seppi:2005}.

Later work on \newterm[factored MDP]{factored MDPs}
\cite{Boutilier+al:2000b,Koller+Parr:2000,Guestrin+al:2003b} uses
structured representations of the value function as well as the transition model, with provable
improvements in complexity.

\newterm[relational MDP]{Relational
MDPs}\ntindex{Markov decision process
(MDP)!relational}~\cite{Boutilier+al:2001,Guestrin+al:2003} go one step
further, using structured representations to handle domains with many
related objects.

When there are multiple agents in the
environment, each agent faces a \newterm{multiagent planning problem}
in which it tries to achieve its own goals with the help or hindrance
of others.

An agent with multiple
effectors that can operate concurrently---for example, a human who can
type and speak at the same time---needs to do \newterm{multieffector
planning} to manage each effector
while handling positive and negative interactions among the
effectors.

When the effectors are physically decoupled into detached
units---as in a fleet of delivery robots in a factory---multieffector
planning becomes \newterm{multibody
planning}.

When communication constraints
make this impossible, we have what is sometimes called a
\newterm{decentralized planning}
problem; this is perhaps a misnomer, because the planning phase is
centralized but the execution phase is at least partially
decoupled.

In a multiagent doubles team, on the other hand, each agent
decides what to do; without some method for \newtermi{coordination},
both agents may decide to cover the same part of the court and each may
leave the ball for the other to hit.

Also, the goals of the company and its employees are brought into
alignment, to some extent, by the payment of
\newterm[incentive]{incentives} (salaries and
bonuses)---a sure sign that this is a true multiagent system.

\subsection{Planning with multiple simultaneous actions}

For the time being, we will treat the multieffector, multibody, and
multiagent settings in the same way, labeling them generically as
\newterm{multiactor} settings, using the generic term \newtermi{actor}
to cover effectors, bodies, and agents.

For simplicity, we 
assume perfect \newtermi{synchronization}: each action takes the same
amount of time and actions at each point in the joint plan are
simultaneous.

In the multiactor setting with \(n\) actors, the single action \(a\) is
replaced by a \newterm{joint action}\ntindex{joint
action} \(\<a_1,\ldots,a_n\>\), where \(a_i\) is the action taken by
the \(i\)th actor.

If the actors are
\newterm{loosely coupled}, can we
attain something close to this exponential improvement?

With this definition, it is easy to see that the following \newterm{joint plan} plan works:

\zt 
\mbox{\sc Plan 1:} &  & \\\zt 
                & A: & [\J{Go}(A, \J{RightBaseline}), \J{Hit}(A, \J{Ball})] \\\zt 
		&\zt  B: &\zt  [\J{NoOp}(B), \J{NoOp}(B)] \ .

We solve this by augmenting action schemas with one new feature: a
\newterm{concurrent action list}
stating which actions must or must not be executed concurrently.

One option is to adopt a \newtermi{convention} before engaging in
joint activity.

When conventions are widespread, they
are called \newtermi{social laws}.

This approach to coordination, sometimes
called \newterm{plan recognition}, works
when a single action (or short sequence of actions) is enough to
determine a joint plan unambiguously.

We can
obtain a reasonable simulation of a flock if each bird agent
(sometimes called a \newtermi{boid}) observes the
positions of its nearest neighbors and then chooses the heading and acceleration that
maximizes the weighted sum of these three components:

    \item Cohesion: a positive score for getting closer to the average position of the 
    neighbors
    \item Separation: a negative score for getting too close
      to any one neighbor
    \item Alignment: a positive score for getting closer to the average heading of the 
    neighbors

If all the boids execute this policy, the flock exhibits the
\newterm{emergent behavior} of flying as a
pseudorigid body with roughly constant density that does not disperse
over time, and that occasionally makes sudden swooping motions.

In this section we study the aspects of
\newtermi{game theory} that  analyze
games with simultaneous moves and other sources of partial
observability.

A single-move game is defined by three
components:


\item \newterm[Player]{Players} or agents who will be making decisions.

\item \newterm[Action]{Actions} that the players can choose.

\item A \newterm{payoff function} that gives the utility to each player  
for each combination of actions by all the players.

For single-move games the payoff
function can be represented by a matrix, a representation known as the \newtermi{strategic form}
(also called \termi{normal form}).


Each player in a game must adopt and then execute a
\newterm{strategy} (which is the name used in
game theory for a {\em policy}).

A \newtermi{pure strategy} is a
deterministic policy; for a single-move game, a pure strategy is just a single
action.

For many games an agent can do better with a \newtermi{mixed
strategy}, which is a randomized policy that selects 
actions according to a probability distribution.

A
\newtermi{strategy profile} is an assignment of a strategy to
each player; given the strategy profile, the game's \newterm{outcome}\tindex{outcome}
is a numeric value for each player.

A \newterm{solution} to a game is a strategy profile in which each
player adopts a rational strategy.

Now Alice and Bob face the so-called
\newtermi{prisoner's dilemma}: should they testify or refuse?

Alice has discovered that \(\J{testify}\) is a \newterm{dominant
strategy} for the game.

We say that a
strategy \(s\) for player \(p\)
\newterm[strong domination]{strongly dominates} strategy \(s'\) if the 
outcome for \(s\) is better for \(p\) than the outcome for \(s'\), for every
choice of strategies by the other player(s).

Strategy \(s\)
\newterm[weak domination]{weakly dominates} \(s'\) if \(s\) is
better than \(s'\) on at least one strategy profile and no worse on any
other.

We need just a bit more terminology: we say that an outcome is \newtermi{Pareto
optimal}\footnote{Pareto optimality is named after the economist
Vilfredo Pareto (1848--1923).}

An outcome is \newtermi{Pareto dominated} by
another outcome if all players would prefer the other outcome.

When each player has a
dominant strategy, the combination of those strategies is called a
\newtermi{dominant strategy equilibrium}.

In general, a strategy
profile forms an \newtermi{equilibrium} if no player can benefit by
switching strategies, given that every other player sticks with the
same strategy.

Games in
which players need to communicate like this are called
\newterm[coordination game]{coordination games}.\ntindex{coordination
game}

A game can have more than one Nash equilibrium; how
do we know that every game must have at least one?

In 1928, von Neumann developed a method
for finding the {\em optimal} mixed strategy for two-player,
\newterm[zero-sum game]{zero-sum games}---\ntindex{zero-sum
game}games in which the
sum of the payoffs is always zero.\footnote{or a constant---see \pgref{zero-sum-game-page}.}

Von Neumann's method is called the
the \newterm{maximin} technique, and it works as follows:

\item Suppose we change the rules as follows: first \(E\) picks her
strategy and reveals it to \(O\).

This
strategy is called the \newtermi{maximin equilibrium} of the game, and
is a Nash equilibrium.

The
simplest kind of multiple-move game is the \newterm{repeated
game},
in which players face the same choice repeatedly, but each time with
knowledge of the history of all players' previous choices.

This strategy could be called \newtermi{perpetual
punishment}.

The most famous,
called \newterm{tit-for-tat}, calls for starting
with \(\J{refuse}\) and then echoing the other player's previous move on
all subsequent moves.

Such games are best represented by a game tree, which game theorists call
the \newtermi{extensive form}.

The extensive form {\em
does} allow us to find solutions because it represents the belief
states (game theorists call them \newtermi{information sets}) of {\em
all} players at once.

\citeA{Koller+al:1996} come to the rescue with an alternative representation of
extensive games, called the \newtermi{sequence form}, that is only
linear in the size of the tree, rather than exponential.

This suggests
forming an \newtermi{abstraction} of the game, one in which suits are
ignored.


First, it does not deal well with
continuous states and actions (although there have been some
extensions to the continuous case; for example, the theory of
\newtermi{Cournot competition} uses game theory to solve problems
where two companies choose prices for their products from a continuous
space).

The notion
of a \newtermi{Bayes--Nash equilibrium} partially addresses this
point: it is an equilibrium with respect to a player's prior
probability distribution over the other players' strategies---in other
words, it expresses a player's beliefs about the other players' likely
strategies.

This problem is
called \newtermi{mechanism design}, or sometimes \termi{inverse game
theory}.

Formally, a
\newtermi{mechanism} consists of (1) a language for describing the set
of allowable strategies that agents may adopt, (2) a distinguished
agent, called the \newterm{center}\ntindex{center (in mechanism
design)}, that collects reports of strategy choices from the agents in
the game, and (3) an outcome rule, known to all agents, that the
center uses to determine the payoffs to each agent, given their
strategy choices.

\subsection{Auctions}

Let's consider \newterm[auction]{auctions} first.

The best-known auction mechanism is the
\newterm{ascending-bid},\footnote{The
word ``auction'' comes from the Latin {\em augere}, to increase.}

or
\newterm{English auction}, in which the
center starts by asking for a minimum (or \term{reserve}\tindex{reserve bid}) bid
\(b_{min}\).

We say an auction is
\newterm{efficient} if the goods go to the
agent who values them most.

Probably the most important things that an auction mechanism can do is
encourage a sufficient number of bidders to enter the game and 
discourage them from engaging in \newtermi{collusion}.

A mechanism where
agents have a dominant strategy is called a
\newterm{strategy-proof} mechanism.

If, as is usually the case, that strategy involves the bidders
revealing their true value, \(v_i\), then it is called a
\newterm{truth-revealing}, or
\term{truthful}, auction; the term \termi{incentive compatible} is also
used.

The \newtermi{revelation principle} states that any mechanism
can be transformed into an equivalent truth-revealing mechanism, so
part of mechanism design is finding these equivalent mechanisms.

An alternative mechanism, which requires much less
communication, is the \newterm{sealed-bid
auction}.

A small change in the mechanism for sealed-bid auctions produces the
\newterm{sealed-bid second-price auction}, also known as a \newterm{Vickrey
auction}.\footnote{Named after William
Vickrey\nindex{Vickrey, W.}  (1914--1996), who won the 1996 Nobel
Prize in economics for this work and died of a heart attack three days
later.}

This is actually a very general result: the
\newtermi{revenue equivalence theorem} states that, with a few minor
caveats, any auction mechanism where risk-neutral bidders have values
\(v_i\) known only to themselves (but know a probability distribution
from which those values are sampled), will yield the same expected
revenue.

This situation is called the
\newtermi{tragedy of the commons}: if
nobody has to pay for using a common resource, then it tends to be
exploited in a way that leads to a lower total utility for all
agents.

More generally, we need to ensure that all
\newtermi{externalities}---effects on global utility that are not
recognized in the individual agents' transactions---are made explicit.

It turns out there is a mechanism, known as
the \newtermi{Vickrey-Clarke-Groves}, or \newtermi{VCG}, mechanism, that
makes it a dominant strategy for each agent to report its true
utility and that achieves an efficient allocation of the goods.

For
example, can we devise methods whereby separate
\newterm[subagent]{subagents} for robot navigation and robot obstacle
avoidance could cooperatively achieve a combined control system that
is globally optimal?

When the agent is
a computer, we call it \newtermi{machine learning}: a computer observes
some data, builds a \newtermi{model} based on the data, and uses the
model to guide future actions.

The model is both a \newtermi{hypothesis}
about the world, and an executable function that can be applied to future
data.

Going from a specific set of observations to a general rule is called
\newtermi{induction}; from the observations that the sun rose every day
in the past, we induce that the sun will come up tomorrow.

There are
three types of \newtermi{feedback} that can accompany the inputs, and
that determine the three main types of learning:


\item In \newtermi{supervised learning} the agent observes some example
input--output pairs and learns a function that maps from input to output.

An output like this is
called a \newtermi{label}.

\item In \newterm{unsupervised learning}
the agent learns patterns in the input without any explicit feedback.

The
most common unsupervised learning task is \newtermi{clustering}:
detecting potentially useful clusters of input examples.

\item In \newtermi{reinforcement learning} the agent learns from a series
of reinforcements---rewards or punishments.

In
\newtermi{semi-supervised learning} we are given a few labeled examples
and must make what we can of a large collection of unlabeled examples.

\section{Supervised Learning}


More formally, the task of supervised learning is this:

\noindent Given a \newtermi{training set} of \(\Ncount\) example
input--output pairs
 (x_1, y_1), (x_2, y_2), \ldots (x_{\Ncount}, y_{\Ncount}) \ , 
where each pair was generated by an unknown function \(y = f(x)\), \\
discover a function \(h\) that approximates the true function \(f\).

It is
drawn from a  \newterm{hypothesis space}, \(\Hyp\),
of possible functions.

If not, we can perform \newtermi{exploratory data analysis}:
examining the data with various plots and statistical tests to get a feel for what
the data looks like, and some insight into what hypothesis space might be appropriate.

We could ask if the hypothesis is
\newterm{consistent} with the training set: for each \(x_i\)
in the training set, \(h(x_i) = y_i\).

But a good hypothesis should also handle inputs it has not yet seen, which we can evaluate
with  a second  sample of \((x_i, y_i)\) pairs called a \newtermi{test set}.

We say that \(h\)
\newterm[generalization]{generalizes} well if it accurately predicts the outputs of the test set.

When the output \(y\) is one of a finite set of values (such as {\em sunny, cloudy} or {\em
rainy}; or {\em true} or {\em false}), the learning problem is called
\newterm{classification}.

When \(y\) is a number (such
as tomorrow's temperature), the learning problem has the (admittedly obscure) name
\newtermi{regression}.\footnote{A better name would have been ``function approximation.''

We call this a \newtermi{nonparametric} model, because
we don't summarize the data with a few parameters, \(w_i\), rather we keep all the data points as part of the model.


\section{Learning Decision Trees}


A \newtermi{decision tree} is a representation of a function
that maps a vector of attribute values to a single output value---a
``decision.''

In general, the input and output values can
be discrete or continuous but for now we will consider only
inputs consisting of discrete values and outputs that are either
{\em true} (a \newterm{positive}
example) or {\em false} (a \newterm{negative}
example).

This
can happen because there is an error or
\newterm{noise} in the data; because the domain is nondeterministic; or
because we can't observe an attribute that would distinguish the examples.

We can evaluate the performance of a learning algorithm with a
\newterm{learning curve}, as shown in
\figref{restaurant-dtl-curve-figure}.

We will now show how to measure importance, using the notion of information gain,
which is defined in terms of \newtermi{entropy}, the fundamental quantity
in information theory \cite{Shannon+Weaver:1949}.

A randomly chosen example from the
training set has the \({\dtvalue}\)th value for the attribute with probability
\((p_{\dtvalue}+n_{\dtvalue})/(p+n)\), so the expected entropy remaining after testing attribute \(A\) is
\J{Remainder}(A) = \sum\limits_{{\dtvalue}=1}^{\Vcount} {\textstyle
         \frac{p_{\dtvalue}+n_{\dtvalue}}{p+n} 
             \BinH(\frac{p_{\dtvalue}}{p_{\dtvalue}+n_{\dtvalue}})}\ .
The \newterm{information gain} from the
attribute test on \(A\) is the expected reduction in entropy:
{\textstyle
   \J{Gain}(A) = \BinH({\textstyle\frac{p}{p+n}}) - \J{Remainder}(A)\ .

This
problem is called \newterm{overfitting}.

For decision trees, a  technique called \newterm{decision tree
pruning} combats
overfitting.

We can answer this question by using a statistical
\newterm{significance test}.

Such a test
begins by assuming that there is no underlying pattern (the so-called
\newterm{null hypothesis}).

This is
known as \newterm{\(\chi^2\)
pruning}.

One final warning: You might think that \(\chi^2\) pruning and
information gain look similar, so why not combine them using an
approach called \newtermi{early stopping}---have the decision tree
algorithm stop generating nodes when there is no good attribute to
split on, rather than going to all the trouble of generating nodes and
then pruning them away.

Another option, good for continuous attributes like
\(\J{Height}\) and \(\J{Weight}\), is a \newterm{split point} test.

\colonitem{Continuous-valued output attribute} If we are trying to
predict a numerical output value, such as the price of an apartment, 
then we need a \newterm{regression
tree} rather than a classification tree.

The name \newtermi{CART}, standing for Classification And Regression Trees,
is used to cover both classes.

Decision
trees are also \newterm{unstable} in that adding just one new example can
change the test at the root, which changes the entire tree.

We call this the \newtermi{stationarity} assumption; without it, all bets
are off.


Examples that satisfy these two equations are called {\em independent and identically
distributed} or \newterm{i.i.d.

For now, we will say that the
optimal fit is the hypothesis that minimizes the \newtermi{error
rate}: the proportion of times that \(h(x)
\neq y\) for an \((x, y)\) example.

The
simplest way to assure this doesn't happen is to split the examples you have
into two sets: a \term{training set} to create the hypothesis, and a
\newtermi{test set} to evaluate it.

We call this approach
\newtermi{cross-validation}.

\item A \newtermi{validation set}, also known as a \termi{development set} or \term{dev set}, to evaluate the candidate models and choose the best one.

We can
squeeze more out of the data using a technique called \newtermi[k-fold
cross-validation]{$k$-fold cross-validation}.

The extreme is \(k=n\), also
known as \newtermi{leave-one-out cross-validation} or \newtermi{LOOCV}.

You can think
of the task of finding the best hypothesis as two subtasks: \newtermi{model
selection} chooses a good hypothesis space, and \newtermi{optimization} (also
called \termi{training}) finds the best hypothesis within that space.

Parameters like \(\J{Degree}\) (for polynomials) or
the threshold for \(\chi^2\) pruning (for decision trees) are called
\newtermi{hyperparameters}---parameters of the hypothesis class, not of the
individual model.\footnote{Although the name ``model selection'' is in common
use, a better name would have been ``model {\em class} selection'' or
``hypothesis space selection.''

We could allow
multiple hyperparameters, which means we would need a more complex optimization
algorithm, such as a \newtermi{grid search} through the multi-dimensional space
of hyperparameter values.

However, in machine learning it is traditional to express this as a negative:
to minimize a \newtermi{loss function} rather than maximize a utility function.

Then the expected
\newtermi{generalization loss} for a hypothesis \(h\) (with respect to loss
function \(L\)) is

 \J{GenLoss}_L(h) = \frac{1}{|\mathcal{E}|} \sum\limits_{(x,y) \in \mathcal{E}} L(y, h(x)) \, P(x,y) \ , 

and the best  hypothesis, \(h^*\), is the one with the minimum expected generalization loss:

 h^* = \argmin_{h \in \Hyp} \J{GenLoss}_L(h) \ .


Because \(P(x,y)\) is not known, the learning agent can only {\em estimate} generalization loss with
\newtermi{empirical loss} on a set of examples \(E\) of size \(N\):

 \J{EmpLoss}_{L,E}(h) = \frac{1}{N}\sum\limits_{(x,y) \in E} L(y, h(x))  \ .

First, we say that a learning problem is
\newterm{realizable} if the hypothesis class \(\Hyp\)
actually contains the true function \(f\).

Third, \(f\) may be nondeterministic
or \newtermi[noise]{noisy}---it may return different values of \(f(x)\) for the
same \(x\).

Traditional methods in statistics and the early years of machine
learning concentrated on \newtermi{small-scale learning}, where the
number of training examples ranged from dozens to the low
thousands.

In recent years there has been more
emphasis on \newtermi{large-scale learning}, often with millions of
examples.

This process of explicitly penalizing complex hypotheses is called
\newtermi{regularization}, because it looks for a function that is more
regular, or less complex.

Note that we are making two choices: the loss
function (\(L_1 or L_2\)), and the complexity measure, which is called a
\newtermi{regularization function}.

A process of \newtermi{feature selection} can be performed
to discard attributes that appear to be irrelevant.

The
\newtermi{minimum description length} or MDL hypothesis minimizes
the total number of bits required.

Questions like this are addressed by \newterm{computational
learning theory}, which lies at
the intersection of AI, statistics, and theoretical computer science.


Thus, any hypothesis that is consistent with a
sufficiently large set of training examples is unlikely to be
seriously wrong: 

that is, it must be \newtermi{probably
approximately correct}.

Any learning algorithm that returns hypotheses that are probably approximately
correct is called a \newtermi{PAC learning} algorithm; we
can use this approach to provide bounds on the performance of various learning
algorithms.

One can
think of an approximately correct hypothesis as being ``close'' to the true
function in hypothesis space: it lies inside what is called the
\newterm{$\epsilon$-ball} around the true
function \(f\).

The number of required examples, as a function of \(\epsilon\)
and \(\delta\), is called the \newterm{sample complexity} of the
hypothesis space.

\subsection{PAC learning example: Learning decision lists}


We now show how to apply PAC learning to a new hypothesis space:
\newterm{decision lists}. A decision list consists of a
series of tests, each of which is a conjunction of literals.

We call this language
\newterm{\(\kdl\)}.

It is easy to show
(\exref{DL-expressivity-exercise}) that \(\kdl\) includes as a subset
the language \newterm{\(\kdt\)}, the set of all decision trees of depth
at most \(k\).

\section{Linear Regression and Classification}

Now it is time to move on from decision trees and lists to a different
hypothesis space, one that has been used for hundreds of years: the class of
\newterm[linear function]{linear functions} of
continuous-valued inputs.

We use the letter \(w\) 
because we think of the coefficients as \newterm[weight]{weights}; the value of
\(y\) is changed by changing the relative weight of one term or
another.

The task of finding the
\(h_{\sw}\) that best fits these data is called \newterm{linear regression}.

Many forms of learning involve adjusting weights to minimize a loss,
so it helps to have a mental picture of what's going on in \newtermi{weight space}---the space defined by all possible settings of the weights.

There we called it \term{hill-climbing}, but here we are minimizing
loss, not maximizing gain, so we will use the term \newtermi{gradient descent}.

We choose any starting point in weight space---here, a point in the (\(w_0\),
\(w_1\)) plane---and then compute an estimate of the gradient and move a small
amount in the downhill direction, repeating until we converge on a minimum loss:


&&{\bw}\,\leftarrow\,\mbox{any point in the parameter space} \nonumber\\
&&\mbox{\k{loop} until convergence \k{do}} \nonumber\\
&&    \qquad\mbox{\k{for each} \(w_i\) \k{in} \(\bw\) \k{do}} \nonumber\\
&&        \qquad\qquad {w_i}\,\leftarrow\,{w_i - \alpha\, \frac{\partial {}}{\partial w_i} \J{Loss}(\bw)} 
        

The parameter \(\alpha\), which we called the \term{step size} in
\secref{continuous-search-section}, is usually called the \newtermi{learning
rate} when we are trying to minimize loss in a learning problem.


These updates constitute the \newterm{batch gradient descent}\ntindex{gradient
descent!batch} learning rule for univariate linear regression 
(also called \term{deterministic gradient descent}).

A faster variant is called \newtermi{stochastic gradient
descent} or \newtermi{SGD}: 
it randomly selects a small number of training examples at each step, and updates according to
\eqref{gradient-descent-one-point-equation}.

The original version of SGD selected
only a single training example for each step, but it is now more common to select a 
\newtermi{minibatch} of \(m\) out of the \(N\) examples.

(In fact,
SGD is also known as \newtermi{online gradient descent}.)

\subsection{Multivariate linear regression}

We can easily extend to \newtermi{multivariate linear regression}
problems, in which each example \(\x_j\) is an \(\Acount\)-element vector.\footnote{The reader may wish to consult \appref{math-appendix} for a brief summary of linear algebra.}

Let \(\y\) be the vector of outputs for the training examples, and \(\X\) be the
\newtermi{data matrix}, i.e., the matrix of inputs with one \(\Acount\)-dimensional example per row.

That depends on the specific problem, but  \(L_1\)
regularization has an important advantage: it tends to produce a \newtermi{sparse model}.


A \newtermi{decision boundary} is a line (or a surface, in higher dimensions) that separates
the two classes.

A linear decision boundary is called a \newterm{linear separator} and
data that admit such a separator are called \newterm[linear
separability]{linearly separable}.


Alternatively, we can think of \(h\) as the result of passing the
linear function \(\bw\cdot \x\) through a \newtermi{threshold
  function}:

 h_{\sw}(\x) = \J{Threshold}(\bw \cdot \x) \mbox{ where } \J{Threshold}(z) \eq 1 \mbox{ if } z \geq 0 \mbox{ and } 0 \mbox{ otherwise.}

This rule is called the \newterm{perceptron
learning rule}, for reasons that
will become clear in \secref{nn-section}.

\figref{perceptron-earthquake-figure}(a) shows a \newterm{training curve} for this learning rule
applied to the earthquake/explosion data shown in \figref{earthquake-figure}(a).

The process of fitting the weights of this model to minimize loss on a
data set is called \newtermi{logistic regression}.

(For this derivation, we will need the \newterm{chain rule}: \(\partial g(f(x))/\partial x \eq g'(f(x))\,\partial f(x)/\partial x\).)

A learning model that summarizes data with
a set of parameters of fixed size (independent of the number of training examples) 
is called a \newtermi{parametric model}.

A \newterm{nonparametric model} is one that cannot be
characterized by a bounded set of parameters.

This approach is called \newterm{instance-based
learning} or \term{memory-based
learning}.

The simplest instance-based learning method is \newtermi{table lookup}:
take all the training examples, put them in a lookup table, and then when
asked for \(h(\x)\), see if \(\x\) is in the table; if it is, return the
corresponding \(y\).

This
is called \newtermi[nearest neighbors]{\(k\)-nearest neighbors} lookup.

Typically, distances
are measured with a \newtermi{Minkowski distance} or \(L^p\) norm, defined as

 L^p(\x_j, \x_q) = (\sum_i | x_{j,i} - x_{q,i} |^p)^{1/p} \ . 

With \(p\eq 2\) this is Euclidean distance and with \(p\eq 1\) it is
Manhattan distance.

With Boolean attribute values, the number of attributes on which the two points differ is called the
\newtermi{Hamming distance}.

A common approach is
to apply \newterm{normalization} to the measurements in each dimension.

A more complex
metric known as the \newtermi{Mahalanobis distance} takes into account the
covariance between dimensions.

This problem has been called the \newterm{curse of
  dimensionality}.

A balanced binary tree over data with an arbitrary number of
dimensions is called a \newtermi{k-d tree}, for k-dimensional
tree.

Hash codes  randomly
distribute values among the bins, but we want to have near points grouped
together in the same bin; we want a
\newterm{locality-sensitive hash} (LSH).

First we define the \newtermi{approximate
  near-neighbors} problem: given a data set of example points and a query
point \(\x_q\), find, with high probability, an example point (or points) that
is near \(\x_q\).

\(k\)-\newtermi{nearest-neighbors regression}
(\figref{nonparametric-regression-figure}(b)) improves on
connect-the-dots.

\newterm{Locally weighted regression}
(\figref{nonparametric-regression-figure}(d)) gives us the advantages
of nearest neighbors, without the discontinuities.

The triangular


We decide how much to weight each example with a function known as a
\newtermi{kernel}.

\subsection{Support Vector Machines}

In the early 2000s, the \newtermi{support vector machine}\ntindex{support
vector machine} (SVM) framework was perhaps the most popular approach for
``off-the-shelf'' supervised learning, for when you don't have any specialized
prior knowledge about a domain.

We call this separator, shown
in \figref{kernel-machine-margin-figure2}(b) the \newtermi{maximum
margin separator}.

The \newtermi{margin} is the width of the area
bounded by dashed lines in the figure---twice the distance from the
separator to the nearest example point.

This is a \newterm{quadratic programming}
optimization problem, for which there are good software packages.


A final important property is that the weights \(\alpha_j\) associated with each data
point are {\em zero} except for the \newterm[support vector]{support
vectors}---the points closest to the
separator.

The expression \((\x_j
\cdot \x_k)^2\) is called a \newtermi{kernel function},\footnote{This usage of ``kernel function'' is slightly different from
the kernels in locally weighted regression.

A venerable result in mathematics, \newterm{Mercer's
theorem}\ntindex{Mercer's
theorem}~\citeyear{Mercer:1909}, tells us that any ``reasonable''\footnote{Here, ``reasonable'' means that the matrix 
\(\mbf{K}_{jk}\eq K(\x_j,\x_k)\) is positive definite.}

For example, the \newterm{polynomial kernel}, \(K(\x_j,\x_k)\eq (1+ \x_j \cdot
\x_k)^d\), corresponds to a feature space whose dimension is
exponential in \(d\).

\subsection{The kernel trick}

This then is the clever \newtermi{kernel trick}: 
Plugging these kernels into \eqref{kernel-qp-equation}, 

optimal linear separators can be found efficiently in feature spaces with
billions of (or, in some cases, infinitely many) dimensions.

That is possible with the \newtermi{soft margin} classifier, which
allows examples to fall on the wrong side of the decision boundary,
but assigns them a penalty proportional to the distance required to
move them back on the correct side.

Once this is done, the dot
product is replaced by a kernel function and we have a
\newtermi[kernelization]{kernelized} version of the algorithm.

The idea of \newterm{ensemble
learning} is to select a 
collection, or \term{ensemble}, of hypotheses, \(h_1, h_2, \ldots, h_n\), 
and combine their predictions by averaging or voting.

\subsection{Bagging}

In \newtermi{bagging},\footnote{Note on terminology: in statistics, a
sample with replacement is called a \termi{bootstrap}, and ``bagging'' is
short for ``bootstrap aggregating.''}

The \newtermi{random
forest} model is a form of decision tree bagging in which we take extra steps
to make the \(M\) trees more random, hopefully reducing variance further.

In place of the regular cross-validation strategy,
we could measure the \newtermi{out-of-bag error}: the mean error on each
example, using only the trees whose example set didn't include that
particular example.

\subsection{Boosting}

The most widely used ensemble method is called
\newterm{boosting}.

To understand how it works, we need
first to explain the idea of a \newterm{weighted training
set}.

\prog{AdaBoost} has a very important
property: if the input learning algorithm \var{L} is a \newterm{weak
learning} algorithm---which means that \var{L}
always returns a hypothesis with accuracy on the training set that is
slightly better than random guessing (i.e., 50\
Boolean classification)---then \prog{AdaBoost} will return a hypothesis
that {\em classifies the training data perfectly} for large enough
\(\Ecount\).

We will choose as our original hypothesis space the class of
\newterm[decision stump]{decision stumps},
which are decision trees with just one test, at the root.

In this case, it matters {\em when} we make a
prediction, so we will adopt the perspective called \newtermi{online
learning}: an agent receives an input \(x_j\) from nature, predicts the
corresponding \(y_j\), and then is told the correct answer.

This
is called the \newtermi{randomized weighted majority algorithm}.

We measure the success of this algorithm in terms of \newtermi{regret},
which is defined as the number of additional mistakes we make compared
to the expert who, in hindsight, had the best prediction record.

Note
that we can choose \(\beta\) to get asymptotically close to \({\Mcount}^*\) in
the long run; this is called \newtermi{no-regret learning} (because
the average amount of regret per trial tends to 0 as the number of trials increases).

\subsubsection{Data collection}

Every machine learning project needs data; in the case of our photo
identification project there are standard image databases, such as
\newtermi{ImageNet}, which has over 14 million photos with about 20,000
different labels.

Whereas the identification-in-the-limit approach concentrates on
eventual convergence, the study of \newterm{Kolmogorov
complexity} or
\termi{algorithmic complexity}, developed independently by
Solomonoff~\citeyear{Solomonoff:1964,Solomonoff:2009} and
Kolmogorov~\citeyear{Kolmogorov:1965}, attempts to provide a formal
definition for the notion of simplicity used in Ockham's
razor.

Approximate measures such as the
\newtermi{minimum description length}, 
or MDL~\cite{Rissanen:1984,Rissanen:2007}
can be used instead and have produced excellent results in
practice.

An independent tradition of sample-complexity analysis has existed in
statistics, beginning with the work on \newterm{uniform convergence
theory}\ntindex{uniform convergence
theory}~\cite{Vapnik+Chervonenkis:1971}.

The so-called \newterm{VC
dimension} provides a measure roughly analogous to, but more general
than, the \(\ln|\Hyp|\) measure obtained from PAC analysis.


\subsection{Computational Graphs and Layers}

Neural networks may be conveniently described using data
structures called \newterm{computational graphs}.

In the context of deep learning, all of these arrays of different dimensionality
are commonly called \newterm{tensors}.

This process of walking through the graph in the direction of the 
directed edges, gradually computing all of the desired values is
called \newterm{forward propagation} in the context of neural networks.

Designers of neural networks often think of the underlying computational
graphs in terms of \newterm{layers}.

We often call these individual entries \newterm{units} rather than
neurons, to be somewhat less anthropomorphic.

Neural networks typically take an input $\vx$ and summarize it by
extracting a set of features called a \newterm{hidden representation} $\vh(\vx)$,
explained in more detail in \secref{sec:hidden}.

A very simple classification model is the \newterm{logistic regression}
model, defining the output distribution

	p(y = 1 \mid vx) = \sigma\left( \vh(\vx)^\top \vw + b \right)
	
where $\vw$ is a set of learned \newterm{weights}
and $b$ is a learned parameter colloquially called a \newterm{bias}.

In the context of neural networks, we usually call this a \newterm{sigmoid layer}.

To represent these distributions, we use a \newterm{softmax} layer:

	p(y = i \mid \vx) = \softmax \left( \vh(\vx)^\top \mW + \vb \right)_i

where $\mW$ is a matrix of learned weights (a different weight vector
for each possible integer value of the output), $\vb$ is a vector of
learned biases, and the $\softmax$ function is

	\softmax(\vz)_i = \frac{\exp(z_i)} {\sum_i \exp(z_i)}.

For example, \newterm{mixture density} layers model continuous
ouputs using a mixture of Gaussian distributions.

We can also think of each intermediate computation as a different
\newterm{representation} for the input $\vx$.

The training data specifies exactly the value that $\vx$ and $y$
should take on for each example, so we call these variables
\newterm{observed}.

In contrast, we call the intermediate computations \newterm{hidden}
because the data does not specify the values of these variables.

Most neural network hidden layers are based on a linear transformation
followed by the application of an elementwise nonlinear function
called the \newterm{activation function} $g$:

\vh = g(\mW+\vb)

where $\mW$ is a learned matrix of weights and $\vb$ is a learned
set of bias parameters.

A series of results commonly referred to as ``the \newterm{universal approximator theorem}''
show that a simple neural network of this form---input $\vx$,
hidden representation $\vh$ defined by a linear transformation
and activation function $g$, and a linear output layer---can approximate
any continuous function on a compact subset of $\R^n$ to an arbitrary
degree of accuracy.

Note that the universal approximator theorem does not imply the model
will correctly \newterm{learn} the desired function.

Very many modern neural networks use the same activation function:
the \newterm{rectified linear unit} or \newterm{ReLU} activation
function \cite{Jarrett-ICCV2009-small,Nair-2010-small,Glorot+al-AI-2011-small}.

We call each of these arrays a \newterm{feature map}.

A feature map is split into several \newterm{channels}.

In convolutional neural networks, we use 4-dimensional feature
maps to represent activations, and we use \newterm{convolution}
to transform from one layer's representation to the next.

This justifies using a very small weight vector $\vk$ called a \newterm{kernel}.

A matrix with the right structure to include a 1-D convolution is
called a \newterm{Toeplitz matrix} while a matrix with the right
structure to encode a 2-D convolution is called a
\newterm{doubly block circulant} matrix.

One operation that can help add some invariance to small translations
into a model is called \newterm{pooling}.

Using \newterm{max-pooling} we report the maximum value in each
group, so that an input of
$[0, 0, 1, 0 0]$


is summarized by max-pooling as $[1, 1]$.

\subsection{Residual Networks}

\newterm{Residual networks} are a popular and successful approach to building
deep networks that was developed in the context of convolutional models
and remains applied mostly in that context \cite{he2016deep}.

The gradient is almost always calculated
using an algorithm called \newterm{back-propagation}.

Gradient descent uses the update rule:

  \vtheta^{(t+1)} = \vtheta^{(t)} - \lrate \nabla_\vtheta L(\vtheta^{(t)}),
  

where $t$ is the time step of the algorithm and $lrate$
is a hyperparameter called the \newterm{learning rate}
determining how large the steps should be.

Fortunately, in most cases, we can use a technique called
\newterm{stochastic gradient descent} (SGD) to dramatically accelerate
learning.

We call the group of $m$ samples used by SGD a \newterm{minibatch} and we call
the hyperparameter $m$ the \newterm{batch size}.

Many optimization algorithms incorporate the idea of
\newterm{momentum}, which keeps a running average of the gradients
of past minibatches in order to compensate for small minibatch
sizes and mitigate some of the problems caused by the
fact that the gradient ignores the second derivatives of the
cost function.

Modern software implementations are based on generic techniques
from the \newterm{symbolic differentiation} and \newterm{automatic differentiation}
literature.

In the context of neural networks, we often call these techniques \newterm{back-propagation},
using the terminology from David Rumelhart and Geoffrey Hinton showed how to successfully
train deep neural networks using the gradient \cite{Rumelhart86b}.

All of the various algorithms for automatically calculating the gradient are based on the
\newterm{chain rule of calculus}.

This recomputing strategy is called \newterm{checkpointing}.

A result called \newterm{the no free lunch theorem} \cite{Wolpert-1996}
shows that, averaged over all possible tasks, every supervised learning
algorithm has the same performance on new test input points.

In the context of neural networks we usually call this approach \newterm{weight decay}.

\subsection{Dropout}

Another way that we can intervene to reduce the test error of a network
at the cost of making it harder to fit the training set is to use
\newterm{dropout} \cite{Srivastava14}.

When we apply the back-propagation algorithm in this context,
it is called \newterm{back-propagation through time},
but it is still exactly the same algorithm as before.

For example, in \newterm{sentiment analysis}, a machine learning
model examines a piece of text, such as a review of a product
on an online website, and returns an classification of the
sentiment of the review.

Such an RNN is an example of a \newterm{generative model},
a model capable of estimating the probability of its input
or generating new examples from its input distribution,
described in \secref{sec:unsupervised}.

We usually think of this as two RNNs working together:
an \newterm{encoder} RNN reads through the input sequence,
and the final state of its hidden representation is used
as the initial state of a \newterm{decoder} RNN that 
produces the output sequence.

Such \newterm{sequence to sequence} models are commonly used
for machine translation \cite{sutskever2014sequence}.

To avoid the problem of vanishing and exploding gradients, most recurrent networks use
a special kind of layer design, such as an \newterm{LSTM} \cite{Hochreiter91,Hochreiter97,Gers-et-al-2000}
or another high-performance RNN layer \cite{jozefowicz2015empirical,zoph2016neural}.

To avoid the problem of exploding gradients specifically, it is common to also use a technique
called \newterm{gradient clipping} \cite{Mikolov-thesis-2012,Pascanu+al-ICML2013-small,Graves-arxiv2013}.

For example, the \newterm{forget gate} $f_t$ can be used to reset $c_t$
to zero quickly.

As we saw in TODO ref sec 18.2, these paradigms include \newterm{unsupervised learning},
\newterm{semi-supervised learning}, and \newterm{transfer learning}.

In this chapter we will describe just three goals:
\newterm{representation learning} and \newterm{generative modeling}.

As a very simple example of a generative model, consider the
\newterm{probabilistic principal components analysis} (pPCA) model \cite{tipping1999probabilistic}:

  p(\vz) = \N(\vz \mid 0, \mI) \\
  p(\vx) = \N(\vx \mid \mW \vz, \sigma^2 \mI).

Many unsupervised deep learning algorithms are based on the idea of an
\newterm{autoencoder}.

An autoencoder is a model containing two parts:
an \newterm{encoder} that maps from $\vx$ to a representation $\hat{\vz}$
and a \newterm{decoder} that maps from a representation $\hat{\vz}$
to observed data $\vx$.

Yes: the \newterm{variational autoencoder} (VAE) \cite{Kingma+Welling:2013,Rezende+al:2014} shows a general connection
between autoencoders and generative models that have latent variables.

Variational methods \cite{jordan1999introduction} replace the intractable
log-likelihood $\log \pmodel(\vx; \vtheta)$ with a tractable expresion
called the \newterm{variational lower bound} $\mathcal{L}(\vx; \vtheta)$.

Instead, variational methods introduce a distribution called
the \newterm{variational posterior} $q(\vz)$.

Another category of generative models is called \newterm{implicit models}.

The most popular approach to implicit modeling today is the
\newterm{generative adversarial networks} (GAN) framework 
\cite{goodfellow2014generative}.

In this approach, one neural network, called the \newterm{generator},
learns to map values from $\vz$ to $\vx$
in order to produce samples from $\pmodel$.

The GAN approach is to introduce a second neural network called the
\newterm{discriminator}, which is a classifier trained to 
classify inputs $\vx$ as real (drawn from the training set) or
fake (created by the generator).

Another important kind of unsupervised learning is called
\newterm{unsupervised translation}.

\subsection{Semi-supervised learning}

\newterm{Semi-supervised learning} algorithms learn to map inputs $\vx$ to
outputs $\vy$, by studying training sets consisting of both labeled
$(\vx, \vy)$ tuples and unlabeled $\vx$ examples.

The best-performing semi-supervised learning algorithms as of 2018
are typically based on the idea of \newterm{consistency regularization}:
causing the model to produce equivalent outputs when it receives
equivlent inputs.

For example, \newterm{virtual adversarial training} 
\cite{miyato2015distributional}
trains the model to assign the same label to both an unlabeled example
$\vx$ and a slightly modified example $\vx + \veta$.

\subsection{Transfer learning}

\newterm{Transfer learning} approaches are able to perform well on one task
by leveraging information learned from studying other tasks.

Object recognition algorithms often perform much better if they are
trained with \newtermi{dataset augmentation}, in which training examples are
copied and modified slightly.

\subsection{Natural Language Processing}  Deep learning has also
had a huge impact on \newterm{natural language processing} (NLP) applications,
such as {\em machine translation}, in which a computer program must accept an
input text in one natural language (such as English) and produce an output text
with the same meaning in another natural language (such as Japanese); and {\em
speech recognition}, in which the computer program receives an audio waveform
recording of human speech and emits a transcription of the words that were
spoken.

Some advantages of deep learning for these applications include that:
(1) deep learning can perform \newterm{end-to-end learning},
(2) deep learning learns useful representations of words,
and
(3) deep learning can learn interchangeable encoders and decoders.

\newterm{End-to-end learning} refers to systems where a deep learning model
represents a function $f$, the entire application is solved by using the
function $f$, and the entirety of the function $f$ can be learned by
backpropagation.

A common approach involving more than one module is to have a
\newterm{language model} that evaluates proposed sentences in the target
language.

Another option is to convert an
integer code $i$ into a \newterm{one-hot vector}, a vector that is all zeroes
except for at position $i$, where it contains a $1$.

Like other machine learning models
\cite{Deerwester+al:1990}, neural nets learn a meaningful vector-valued
representation of words called an \newterm{embedding}.

The first layer of a
neural network applied to discrete tokens is called an \newterm{embedding
lookup layer}.

After learning to encode and decode many languages, it is possible
to perform \newterm{zero-shot} translation of language pairs that
were not included in the training set.

\newterm[Hopfield network]{Hopfield networks}
\cite{Hopfield:1982} are models that can learn to store patterns in an
associative memory, so that an entire pattern can be retrieved by
indexing into the memory using a fragment of the whole pattern.

Hopfield
networks are deterministic; they were later generalized to
\newterm[Boltzmann machine]{Boltzmann machines}\ntindex{Boltzmann
machine} \cite{Hinton+Sejnowski:1983,Hinton+Sejnowski:1986}.

The field
of \newterm{computational neuroscience} aims to build computational
models that really do capture important and specific properties of actual
biological systems.


\newterm{Bayesian learning} simply
calculates the probability of each hypothesis, given the data,  and
makes predictions on that basis.

The key quantities in the Bayesian approach are the
\newterm{hypothesis prior}, \(P(h_i)\), and the \newterm{likelihood} of
the data under each hypothesis, \(P(\data\given h_i)\).

This
is often called a \newterm{maximum a posteriori}\ntindex{maximum a
posteriori} or MAP (pronounced ``em-ay-pee'') hypothesis.

This is called a \newterm{maximum-likelihood} 
(ML}) hypothesis, \(\hml\).
Maximum-likelihood learning is very common in statistics, a discipline in which many
researchers distrust the subjective nature of hypothesis priors.


\vfill

\section{Learning with Complete Data}

The general task of learning a probability model, given data that are
assumed to be generated from that model, is called \newtermi{density
estimation}.

This section covers the simplest case, where we have
\newtermi{complete data}.

We focus on \newterm{parameter learning}---finding the numerical
parameters for a probability model whose structure is fixed.

Because the \(\log\) function is monotomic, the same value is obtained
by maximizing the \newtermi{log likelihood} instead:

  L(\data \given h_{\theta}) = \log P(\data \given h_{\theta}) = \sum_{j\eq 1}^{\Ncount} \log P(\datum_j\given h_{\theta}) =
  c\log\theta +  \ell\log(1-\theta)\ .

It turns out that the uniform density
is a member of the family of \newterm[beta distribution]{beta distributions}.

Each beta distribution is defined by two
\newterm[hyperparameter]{hyperparameters}\footnote{They
are called hyperparameters because they parameterize a distribution
over \(\theta\), which is itself a parameter.}

The beta family is called the \newterm{conjugate
prior} for the family of distributions for a
Boolean variable.\footnote{Other conjugate priors include the
\term{Dirichlet} family for the
parameters of a discrete multivalued distribution and the
\termi{Normal--Wishart} family for the parameters of a Gaussian
distribution.

Thus, we can view the \(a\) and \(b\) hyperparameters as
\newtermi{virtual counts}, in the sense that a prior \(\BetaDist[a,b]\)
behaves exactly as if we had started out with a uniform prior
\(\BetaDist[1,1]\) and seen \(a-1\) actual cherry candies and \(b-1\) actual
lime candies.

Usually, we assume \newtermi{parameter independence}:

  \pv(\Theta,\Theta_1,\Theta_2) = \pv(\Theta)\pv(\Theta_1)\pv(\Theta_2)\ .

In the latter case, it makes sense to choose \(\theta_0\) to be 0 and \(\sigma_0^2\) to be large---a so-called \newterm{uninformative prior}.

The
task of \newterm{nonparametric density estimation}
is typically done in continuous domains, such as that shown in
\figref{knn-circles-figure}(a).

Many
real-world problems have \term{hidden variables} (sometimes called
\newterm[latent variable]{latent variables}), which are not observable in the data that are
available for learning.

This section describes an algorithm called
\newterm{expectation--maximization}, or EM, that solves this problem
in a very general way.

\subsection{Unsupervised clustering: Learning mixtures of Gaussians}


\newterm{Unsupervised clustering} is the problem of discerning
multiple categories in a collection of objects.

Clustering presumes that the data are generated
from a \newterm{mixture distribution}, \(P\).

Such a distribution has \(k\)
\newterm[component]{components}\ntindex{component (of mixture
distribution)}, each of which is a distribution in its
own right.

For continuous data, a natural choice for the component distributions
is the multivariate Gaussian, which gives the so-called
\newterm{mixture of Gaussians} family of distributions.

The E-step, or {\em expectation} step, can be viewed as computing the
expected values \(p_{ij}\) of the hidden \newterm[indicator variable]{indicator
variables} \(Z_{ij}\), where \(Z_{ij}\) is 1 if
datum \(\x_j\) was generated by the \(i\)th component and 0 otherwise.

We say that the two-attribute model is not
\newterm[identifiability]{identifiable}.

One possible improvement is the so-called
\newterm{structural EM} algorithm, which operates in much the same way
as ordinary (parametric) EM except that the algorithm can update the
structure as well as the parameters.

\citeA{Pearl:2000}
has presented convincing arguments to the contrary, showing that there
are in fact many cases where causality can be ascertained and
developing the \newterm{causal network} formalism to express causes
and the effects of intervention as well as ordinary conditional
probabilities.

There is also a rapidly growing literature on nonparametric Bayesian
methods, originating with the seminal work of \citeA{Ferguson:1973} on
the \newtermi{Dirichlet process}, which can be thought of as a
distribution over Dirichlet distributions.

The text by \citeA{Rasmussen+Williams:2006} covers the
\newtermi{Gaussian process}, which gives a way of defining prior
distributions over the space of continuous functions.

An alternative is \newtermi{reinforcement learning}, in which an agent plays a game
and at the end of the game receives a positive \termi{reinforcement} or
\newtermi{reward} for winning, or a negative one for losing.

\item A \newtermi{Q-learning} agent learns an \termi{action-utility function}, or     
	\newterm[Q-function]{Q-function}: \(Q(s, a)\) is the expected utility of taking a given action
    in a given state.


We begin in \secref{passive-rl-section} with \newterm{passive
learning}, where the agent's policy is fixed and the task
is to learn the utilities of states (or of state--action pairs); this could also
involve learning a model of the environment.

\secref{active-rl-section} covers
\newterm{active learning}, where
the agent must also learn what to do.

The principal issue is
\newterm{exploration}\tindex{exploration}: an agent must experience as much as
possible of its environment in order to learn how to behave in it.

\secref{rl-safety-section} covers the issue of \newtermi{safety}---how can an agent
avoid disaster when it is taking previously-unexplored actions?

We call this a 
\newtermi{passive learning agent}.

The agent executes a set of \newterm[trial]{trials} in
the environment using its policy \(\pi\).

\subsection{Direct utility estimation}

The idea of \newterm{direct utility estimation} is that
the utility of a state is defined as the expected total reward from that state
onward (called the expected \newtermi{reward-to-go}), and that each trial provides a {\em sample} of this quantity for each
state visited.

\subsection{Adaptive dynamic programming}

An \newterm{adaptive dynamic programming} (or
{ADP}) agent 
takes advantage of the constraints among the utilities of states
by learning the transition model that connects them and
solving the corresponding Markov decision process using dynamic
programming.

Because this update rule uses the difference in utilities between
successive states (and thus successive times), it is often called the
\newterm{temporal-difference} (TD) equation.

The \newterm{prioritized sweeping} heuristic prefers to make
adjustments to states whose {\it likely} successors have just
undergone a {\it large} adjustment in their own utility
estimates.

We will call this agent a
\newterm{greedy agent}, because it greedily takes
the currently-optimal action at each step.

An agent therefore must make a tradeoff between
\newterm{exploitation} of the current best action to maximize its short-term reward,
and \newterm{exploration}\tindex{exploration} of previously unknown states
to gain information, which can in some cases be converted into a better policy
and better long-term rewards.

This question has
been studied in depth in the subfield of statistical decision theory
that deals with so-called \newterm[bandit problem]{bandit problems}.


Any such scheme should not be greedy in terms of
the immediate next move, but should 
be greedy in what is called the limit of infinite exploration, or \newterm{GLIE}.


Here, \(f(u,n)\) is called the \newterm{exploration
function}.

For this reason, Q-learning is called a
\newterm[]{model-free} method.


Q-learning has a close relative called \newtermi{SARSA} (for State-Action-Reward-State-Action).

Because Q-learning uses the best Q-value,
it pays no attention to the actual policy being followed---it is an
\newterm{off-policy}
learning algorithm, whereas SARSA is an
\newterm{on-policy}
algorithm.

An alternative approach is \newtermi{apprenticeship learning}, in which we learn by
observing an expert drive the car, and tell our algorithm ``do it like that.''

This is called \newtermi{imitation learning}.

We call
this approach \newtermi{inverse reinforcement learning} (IRL): learning rewards by
observing a policy, rather than learning a policy by observing rewards.

The first approach, \newtermi{Bayesian reinforcement learning}\ntindex{reinforcement
learning!Bayesian}, assumes a prior probability \(P(h)\) over hypotheses \(h\) about what
the true model is; the posterior probability \(P(h\given \e)\) is obtained in the usual
way by Bayes' rule given the observations to date.

The second approach, derived from \newtermi{robust control theory}\ntindex{control
theory!robust}, allows for a set of possible models
\({\cal H}\) without assigning probabilities to them, and defines an optimal robust policy as one that gives the best outcome
in the {\em worst case} over \({\cal H}\): 

 \pistar = \argmax_{\pi} \min_h U_h^{\pi}\ .


Instead of, learning \({10}^{{20}}\) state values in a table,
a reinforcement learning algorithm can learn, say, 20 values for the parameters
\(\theta\eq \theta_1,\ldots,\theta_{20}\) so that
\(\hat{U}_{\theta}\) is a good \newtermi{function approximator} to the true utility function.


This is called the \newtermi{Widrow--Hoff rule}, or the \newtermi{delta
rule}, for online least-squares.

\subsection{Hierarchical reinforcement learning}

Another way to generalize is with \newtermi{hierarchical reinforcement learning} (HRL),
which attempt to solve problems at multiple levels of abstraction---much like the
\termi{HTN planning} methods of \chapref{advanced-planning-chapter}.

One complication is that these lower-level
behaviors will take different amounts of times; to deal with that we use a
\indextext{semi-Markov decision process} or \newtermi{SMDP} which allows actions to have
different durations, including nondeterministic durations.

One common method, used originally in animal training, is called
\newterm{reward shaping}.

This involves supplying the
agent with additional rewards, called
\newterm[pseudoreward]{pseudorewards}, for ``making
progress.''

HRL can be thought of as providing the agent with a \newterm{partial
program} that constrains the agent's behavior to have a
particular hierarchical structure.


\section{Policy Search}

\def\plv{\rho}
\def\Qstar{Q^*}

The final approach we will consider for reinforcement learning
problems is called \newterm{policy search}.

For this reason, policy search
methods often use a \newterm{stochastic policy} representation
\(\pi_{\theta}(s,a)\), which specifies the {\em probability} of
selecting action \(a\) in state \(s\).

One popular representation is the
\newtermi{softmax function}:

  \pi_{\theta}(s,a) = e^{\hat{Q}_{\theta}(s,a)} / \sum_{a'} e^{\hat{Q}_{\theta}(s,a')}\ .

Let \(\plv(\theta)\) be the
\newterm{policy value}, i.e., the expected
reward-to-go when \(\pi_{\theta}\) is executed.

We can follow the \newterm{policy gradient} vector
\(\nabla_{\theta}\plv(\theta)\) provided \(\plv(\theta)\) is
differentiable.

This idea, called \newtermi{correlated
  sampling}, underlies a policy-search algorithm called \system{Pegasus}
\cite{Ng+Jordan:2000}.

More recently, programs for playing Go and other games (and for controlling robots)
have relied on a policy gradient technique called \termi{Asynchronous Advantage
Actor-Critic} (\newtermi{A3C}) models.


\subsection{Application to robot control}

 The setup for the famous
\newterm{cart--pole} balancing problem, also known as the
\newterm{inverted pendulum}, is shown in
\figref{cart-pole-figure}.

The actions are usually discrete: jerk left or jerk
right, the so-called \newterm{bang-bang control} regime.


Later methods include the \newterm{CMAC} (Cerebellar Model
Articulation Controller)~\cite{Albus:1975}, which is essentially a sum of
overlapping local kernel functions, and the associative neural networks of
\citeA{Barto+al:1983}.

But it is also possible to apply reinforcement learning
to structured representations rather than atomic ones; this is called
\newtermi{relational reinforcement
learning} \cite{Tadepalli+al:2004}.

A \newtermi{probabilistic language model} is a
probability distribution describing the likelihood of any string: we want a
model to say that ``Do I dare disturb the universe?''


The application of naive Bayes to strings of words is called the \newtermi{bag
of words model}.

We can learn the prior probabilities needed for this model via supervised
training on a body or \newtermi{corpus}
of text, where each segment of text is labeled
with a class.

This is known as an \newterm{n-gram model}\index{n-gram
model@{\it n}-gram model} (from the Greek root {\em gramma} meaning ``written thing''): a
sequence of written symbols of length \(n\) is called an\(n\)-gram, with special case
``unigram'' for 1-gram, ``bigram'' for 2-gram, and ``trigram'' for 3-gram.


\(n\)-gram models work well for
classifying newspaper sections, as well as for other classification tasks such
as \newtermi{spam detection} (distinguishing spam email from non-spam),
sentiment analysis (classifying a movie or product review as positive or
negative) and author attribution (Hemingway has a different style and
vocabulary than Faulkner or Shakespeare).

Character models are well-suited for the task of \newtermi{language
identification}: given a text, determine what language it is written in.

Another possibility is the \newtermi{skip-gram} model, in which we count words
that are near each other, but skip a word (or more) between them.

There is always a chance that we will be asked to evaluate a text containing an
unknown or \newtermi{out of vocabulary} word: one that never appeared in the
training corpus.

We don't want some of them to have a
zero probability while others have a positive probability; we want to apply
\newtermi{smoothing} to all the similar \(n\)-grams---reserving some of the
probability mass of the model for never-seen \(n\)-grams, to reduce the
variance of the model.

Laplace smoothing (also called add-one smoothing)\index{Laplace
smoothing} is a step in the right direction, but for
many natural language applications, it performs
poorly.

Another choice is a \newtermi{backoff model}, in which we start by estimating
\(n\)-gram counts, but for any particular sequence that has a low (or zero)
count, we back off to \((n-1)\)-grams.

\newterm{Linear interpolation
smoothing} is a backoff model that
combines trigram, bigram, and unigram models by linear interpolation.

One type of structured word model is a \newtermi{dictionary}, usually
constructed through manual labor.

For example, \newtermi{WordNet} is an
open-source, hand-curated dictionary in machine-readable format that has proven
useful for many natural language applications.

We call this smaller,
dense vector a \newtermi{word embedding}: a low-dimensional vector embedded in
a higher-dimensional space.

The approach is called
\newtermi{skip-gram negative sampling} (\system{SGNS}), and is part of the open-source
\newtermi{word2vec} package.

The words and phrases fit into a hierarchical
tree structure that we call the \newtermi{phrase structure} of the sentence.

{\em TODO: figure}

A \newtermi{lexical category} (also known as a \termi{part of speech}) such as
{\em noun} or {\em verb} is a useful generalization, one that can be partially
captured by word embedding models, but can also be represented explicitly in a
dictionary.

\newtermi{Syntactic categories}, such
as {\em noun phrase} or {\em verb phrase}, help to further constrain the
probable words at each point within a sentence, and also provide a framework
for the meaning or \termi{semantics} of the sentence.

There have been many competing language models based on the idea of
hierarchical structure; in this section we will describe a popular model called
the \newterm{probabilistic context-free
grammar}, or PCFG.

A
\newtermi{grammar} is a collection of rules that defines a \newtermi{language}
as a set of allowable strings of words.


\noindent Unfortunately, the grammar 
\newterm[overgeneration]{overgenerates}: that is, 
it generates sentences that are not grammatical, such as ``Me go
Boston'' and ``I smell pits wumpus Ali.''

It also
\newterm[undergeneration]{undergenerates}: there
are many sentences of English that it rejects, such as ``I think the
wumpus is smelly.''

\subsection{The lexicon of \protect\Eng{0}}

The \newterm{lexicon}, or list of allowable words, is defined
in \figref{e0-lexicon-figure}.

These five categories are called \newterm[open class]{open
classes}.

Pronouns, relative pronouns, articles,
prepositions, and conjunctions are called \newterm[closed class]{closed
classes}; they have a small number of words (a dozen or
so), and change over the course of centuries, not months.

\section{Parsing}


\newterm{Parsing} is the process of analyzing a string of
words to uncover its phrase structure, according to the rules of a grammar.

There are many types of chart parsers; we describe a probabilistic
version of a bottom-up chart parsing algorithm called the \newtermi{CYK
algorithm}, after its inventors, Ali Cocke, Daniel Younger, and Tadeo
Kasami.\footnote{Sometimes the authors are credited in the order CKY.}

This grammar format, called \newtermi{Chomsky Normal Form}, may seem
restrictive, but it is not: any context-free grammar can be automatically
transformed into Chomsky Normal Form.

A beam search parser with \(b = 1\) is called a \newtermi{deterministic
parser}.

One popular deterministic approach is \newtermi{shift-reduce parsing},
in we go through the sentence word-by-word, choosing at each point whether to
shift the word onto a stack of constituents, or to reduce the top
constituent(s) on the stack according to a grammar rule.

\subsection{Dependency parsing}

There is a widely-used alternative syntactic approach called
\newtermi{dependency grammar}, that assumes syntactic structure is formed by
binary relations between lexical items, without a need for syntactic
categories.

A corpus of correctly parsed
sentence/tree pairs is commonly called a \newtermi{treebank}.

They call their approach \newtermi{data-oriented
parsing}.

An alternative approach is \newtermi{unsupervised parsing}, in which we learn a
grammar directly from a set of unparsed sentences, without being given the
trees for those sentences.

This can also be done with \newtermi{semisupervised parsing}, in which we start
with a small number of trees as data, then add a large number of unparsed
sentences.

\cite{Spitkovsky+al:2010} describe an unsupervised learning approach
that uses \newtermi{curriculum learning}: start with the easy part of the
curriculum--short sentences like ``He left'' can be easily parsed based on the
supervised starting knowledge, and each new parse of a short sentence extends
the system's knowledge so that it can eventually tackle 3-word, then 4-word,
and eventually 40-word sentences.

All of this will be accomplished with an
\newtermi{augmented grammar} in which the nonterminals are not just atomic
symbols like \bnf{NP}, but are factored or structured representations, as in
\bnf{NP}\((s, \J{pn}, n)\).

A \newtermi{lexicalized
PCFG} is a type of augmented grammar that allows us to assign
probabilities based on the {\em words} in a phrase, not just the syntactic
categories.

we will introduce the notion of the
\newtermi{head} of a phrase---the most important word.

The grammar rules
obey the principle of \newtermi{compositional semantics}---the
semantics of a phrase is a function of the semantics of the
subphrases.

\newterm{Quantification}\tindex{quantification}: Consider the sentence ``Every
agent feels a breeze.''


One standard approach to quantification is for the grammar to define
not an actual logical semantic sentence, but rather a
\newtermi{quasi-logical form} that is then turned into a logical
sentence by algorithms outside of the parsing process.

\newterm{Pragmatics}\tindex{pragmatics}: We have shown how an agent can
perceive a string of words and use a grammar to derive a set of possible
semantic interpretations.

The most obvious need for pragmatic information is in resolving the
meaning of \newterm[indexical]{indexicals}, which are
phrases that refer directly to the current situation.

The speaker's
action is considered a \newtermi{speech act}, and it is up to the hearer to
decipher what type of action it is---a question, a statement, a promise, a
warning, a command, and so on.

We need
to distinguish commands from statements, so we alter the rules for \bnf{S} to
include the type of speech act:

\zt 
S(\J{Statement}(\J{Speaker}, \J{pred}(\J{obj}))) \bnfeq
	\J{NP}(\J{obj})  \bl \J{VP}(\J{pred})\\\zt 
S(\J{Command}(\J{Speaker}, \J{pred}(\J{Hearer}))) \bnfeq
	\J{VP}(\J{pred}) \ .\\\zt 


\newterm{Long-distance dependencies}\tindex{long-distance dependencies}: In the
question ``Who did the agent tell you to give the gold to?''

It should be
parsed as [\bnf{PP} to \textvisiblespace], where the ``\textvisiblespace''
denotes a gap or \newtermi{trace} where an \bnf{NP} is missing; the missing
\bnf{NP} is licensed by the first word of the sentence, ``who.''

\newterm{Time and tense}: Suppose we want to represent the difference
between ``Ali loves Bo'' and ``Ali loved Bo.''

 
\newterm{Ambiguity}: We tend to think of ambiguity as a
failure in communication; when a listener is consciously aware of an ambiguity
in an utterance, it means that the utterance is unclear or confusing.

\newterm{Lexical ambiguity} is when a word has
more than one meaning: ``back'' can be an adverb (go back), an
adjective (back door), a noun (the back of the room) or a verb (back up your
files).

\newterm{Syntactic
ambiguity} refers to a phrase that has multiple
parses: ``I smelled a wumpus in 2,2'' has two parses: one where the
prepositional phrase ``in 2,2'' modifies the noun and one where it modifies the
verb.

The syntactic ambiguity leads to a \newterm{semantic
ambiguity}, because one parse means that the wumpus
is in 2,2 and the other means that a stench is in 2,2.

A
\newterm{metonymy} is a figure\index{figure of
speech} of speech in which one object is used to stand for another.

A \newterm{metaphor} is another figure\index{figure of
speech} of speech, in which a phrase with one literal meaning is used
to suggest a different meaning by way of an analogy.

\newterm{Disambiguation}\tindex{disambiguation} is the process of
recovering the most probable intended meaning of an utterance.

\newterm{Speech recognition} is the task of
transforming spoken sound into text.

\newtermi{Machine translation} transforms text in one language to another.

\newterm{Text-to-speech} synthesis is the inverse
process---going from text to sound.

\newterm{Information extraction}\ntindex{information
extraction} is the process of acquiring knowledge by skimming
a text and looking for occurrences of particular classes of objects and for
relationships among them.

\newterm{Information retrieval} is the task of
finding documents that are relevant and important for a given query.

\newterm{Question Answering}\ntindex{question
answering} is a different task, in which the
query really is a question, such as ``Who founded the US Coast Guard?''

Prominent linguists, such as Chomsky
\citeyear{Chomsky:1957} and Pinker \citeyear{Pinker:2003}, have used
Gold's result to argue that there must be an innate \newtermi{universal
grammar} that all children have from birth.

\noindent\newterm{Perception} provides agents with
information about the world they inhabit by interpreting the response
of \newterm[sensor]{sensors}.

For vision, the sensor model can be broken into two components: An
\newtermi{object model} describes the objects that inhabit the visual
world---people, buildings, trees, cars, etc.

A
\newtermi{rendering model} describes the physical, geometric, and
statistical processes that produce the stimulus from the world.

The
\newtermi{feature extraction} approach, as exhibited by {\em
Drosophila}, emphasizes simple computations applied directly to the
sensor observations.

In the \newtermi{recognition} approach an agent
draws distinctions among the objects it encounters based on visual and
other information.

Finally, in the \newtermi{reconstruction} approach an
agent builds a geometric model of the world from an image or a set of
images.


\subsection{Images without lenses: The pinhole camera}

Image sensors gather light scattered from objects in a
\newtermi{scene} and create a two-dimensional \newtermi{image}.

In cameras, the image is
formed on an image plane, which can be a piece of film coated with
silver halides or a rectangular grid of a few million photosensitive
\newterm[pixel]{pixels}, each a complementary metal-oxide
semiconductor (CMOS) or charge-coupled
device (CCD).

The simplest way to form a focused
image is to view stationary objects with a \newterm{pinhole
camera}, which consists of a pinhole opening,
\(O\), at the front of a box, and an image plane at the back of the
box (\figref{pinhole-figure}).


These equations define an image-formation process known as
\newterm{perspective projection}.

We call \(p_\infty\)
the \newterm{vanishing point} associated with
the family of straight lines with direction \((U, V, W)\).


We can gather more
photons by keeping the pinhole open longer, but then we will get
\newtermi{motion blur}---objects in the scene that move will appear
blurred because they send photons to multiple locations on the image
plane.

Vertebrate eyes and modern cameras use a
\newterm{lens} system to gather sufficient light while
keeping the image in focus.

However, lens systems have a limited
\newterm{depth of field}: they can focus light
only from points that lie within a range of depths (centered around a
\newterm{focal plane}).

The appropriate model is 
\newterm{scaled orthographic projection}\ntindex{scaled orthographic
projection}.

The first cause is
\newterm{overall intensity} of the light.

Second, different points in the scene may
\newterm{reflect}\tindex{reflectivity} more or less of the light.

Third, surface patches facing the
light are brighter than surface patches tilted away from the light, an
effect known as \newtermi{shading}.


\newterm{Radiometry}, a field which deals with the measurement of radiation,


Most surfaces reflect light by a process of \newtermi{diffuse
reflection}.

The behavior of a perfect mirror is known as
\newtermi{specular reflection}.

Some surfaces---such as brushed metal,
plastic, or a wet floor---display small patches where specular
reflection has occurred, called \newtermi{specularities}.

We model this behavior  as a
\newtermi{distant point light source}.


}


A diffuse surface patch illuminated by a distant point light source will reflect some
fraction of the light it collects; this fraction is called the
\newtermi{diffuse albedo}.

\newtermi{Lambert's cosine law} states that the brightness of a diffuse
patch is given by
I=\rho I_0 \cos \theta \ , 
where \(\rho\) is the diffuse albedo, \(I_0\) is the intensity of the
light source and \(\theta\) is the angle between the light source
direction and the surface normal (see \figref{illuminationlambert}).

If
the surface is not reached by the light source, then it is in \newtermi{shadow}.

These \newtermi{interreflections} can have a significant effect on the
brightness of other surfaces, too.

These effects are sometimes
modeled by adding a constant \newtermi{ambient illumination} term to
the predicted intensity.

The \newterm{principle of trichromacy} states
that for any spectral energy density, no matter how complicated, it is
possible to construct another spectral energy density consisting of a
mixture of just three colors---usually red, green, and blue---such
that a human can't tell the difference between the two.

In fact, human observers are
quite good at ignoring the effects of different colored lights and are
able to estimate the color of the surface under white light,
an effect known as \newtermi{color constancy}.

\subsection{Edge detection}


\newterm[Edge]{Edges} are straight lines or curves
in the image plane across which there is a ``significant'' change in
image brightness.

The \newtermi{Gaussian filter} does just that.

We say that the function \(h\) is the \newtermi{convolution} of
two functions \(f\) and \(g\) (denoted  \(f\ast g\)) if we have

h (x) = (f\ast g)(x) = \sum_{u = -\infty}^{+\infty}\!

$$
This gives us a $\theta=\theta(x,y)$ at every pixel, which defines the edge \newtermi{orientation} at that pixel.


In everyday language, \newterm{texture} is the visual
feel of a surface---what you see evokes what the surface might feel like if you
touched it (``texture'' has the same root as ``textile'').

When an object in the video is
moving, or when the camera is moving relative to an object, the
resulting apparent motion in the image is called \newterm{optical
flow}.

One possible measure of similarity is
the \newterm{sum of squared differences}\ntindex{sum of squared
differences} (SSD):

{\rm SSD}(D_x,D_y) = \sum\limits_{(x,y)} (I(x,y,t)-
I(x+D_x,y+D_y, t+D_t))^2\ .


\subsection{Segmentation of  images}


\newterm{Segmentation} is the
process of breaking an image into \newtermi{regions} of similar pixels.

A popular strategy is to produce an
over-segmentation of an image, containing hundreds of homogeneous
regions known as \newtermi{superpixels}.


\section{Object Recognition by Appearance}

\newterm{Appearance}\tindex{appearance} is shorthand for what an object tends to look
like.

This strategy is sometimes
called the \newtermi{sliding window}.

The resulting feature is usually called a \newtermi{HOG
feature} (for Histogram Of Gradient orientations).


This point is called the \newterm{focus
of expansion}\ntindex{focus
of expansion} of the flow field.

Predators have the eyes in the front, enabling them to use
\newtermi{binocular stereopsis}.

Because a given feature in the scene will be in a different
place relative to the \(z\)-axis of each image plane, if we superpose
the two images, there will be a \newterm{disparity}
in the location of the image feature in the two images.

Under normal viewing
conditions, humans \newterm{fixate}; that is, there
is some point in the scene at which the optical axes of the two eyes
intersect.


In humans, \(b\) (the \newterm{baseline} distance between the eyes) is about 6
cm.

In
\figref{texture-gradient-figure} we see that a homogeneous texture in
the scene results in varying texture elements, or
\newterm[texel]{texels}, in the image.

These


Features such as local convexity and symmetry provide cues to
solving the \newterm {figure-ground} problem---assigning which
side of the contour is figure (nearer), and which is ground
(farther).

Humans, like many other
terrestrial animals, are very often in a scene that contains a
\newterm {ground plane}, with various objects at different locations
on this plane.

In the case of rigid objects, whether
three-dimensional or two-dimensional, this problem has a simple and
well-defined solution based on the \newtermi{alignment method}, which we 
now develop.


Often we express the surface orientation using the variables
\newterm{slant} and \newterm{tilt}.

What is preserved is the
\newterm{shape} of the object.

A model called a \newtermi{deformable template} can tell us
which configurations are acceptable: the elbow can bend but the head
is never joined to the foot.

We assume that the position and orientation
(\newtermi{pose}) of the left lower arm is independent of all other
segments given the pose of the left upper arm; that the pose of the
left upper arm is independent of all segments given the pose of the
torso; and extend these assumptions in the obvious way to include the
right arm and the legs, the face, and the hair.

The 
model is usually known as a \newtermi{pictorial structure model}.

We call the description of what a person looks like the
\newtermi{appearance model}.

If the absolute value of the difference is large, this
\newterm{background subtraction}  declares the pixel to be a foreground pixel; by linking foreground
blobs over time, we obtain a track.

Also popular was an
approach based on describing shapes in terms of volumetric primitives,
with \newtermi[generalized cylinder]{generalized cylinders},
introduced by Tom ~\citeA{Binford:1971}, proving particularly popular.

\section{Introduction}


\newterm[Robot]{Robots} are physical agents that perform tasks
by manipulating the physical world.

To do so, they are equipped with
\newtermi[effector]{effectors} such as legs, wheels, joints, and
grippers.

Robots are also equipped
with \newterm[sensor]{sensors}, which allow them to perceive their
environment.

\newtermi[manipulator]{Manipulators}, or robot arms
(\figref{rover+humanoid-figure}(a)), are physically anchored to their
workplace, for example in a factory assembly line or on the
International Space Station.

The second category is the \newterm{mobile
robot}.

\newterm[UGV]{Unmanned ground
vehicles}, or UGVs,
drive autonomously on streets, highways, and off-road.

The \newtermi{planetary
rover} shown in \figref{predator+sojourner-figure}(b) explored Mars for a
period of 3 months in 1997.

Other types of mobile robots include
\newterm[UAV]{unmanned air vehicles}\ntindex{unmanned air vehicle
(UAV)} (UAVs), commonly used for surveillance, crop-spraying, and
military operations.

\newterm[AUV]{Autonomous underwater
vehicles} (AUVs) are used
in deep sea exploration.

The third type of robot combines mobility with manipulation, and 
is often called a \newterm{mobile manipulator}.

\newtermi[humanoid
robot]{Humanoid robots} mimic the human
torso.

\newterm[passive sensor]{Passive
sensors}, such as cameras, are true observers
of the environment: they capture signals that are generated by other
sources in the environment.

\newterm[active sensor]{Active
sensors}, such as sonar, send energy into the
environment.

\newtermi[range finder]{Range finders} are sensors that measure the
distance to nearby objects.

In the early days of robotics, robots
were commonly equipped with \newtermi{sonar sensors}.

\newterm{Stereo vision} (see
\secref{stereo-vision-section}) relies on multiple cameras to image
the environment from slightly different viewpoints, analyzing the
resulting parallax in these images to compute the range of surrounding
objects.

\figref{toc-camera}(a) shows a \newtermi{time of flight camera}.

These sensors are called \newtermi{scanning lidars} (short
for \emph{light detection and ranging}).

On the other extreme end of range sensing are
\newtermi{tactile sensors} such as whiskers, bump panels, and
touch-sensitive skin.

A second important class of sensors is \newtermi{location sensors}.

Outdoors, the \newterm{Global Positioning
System} (GPS)\index{GPS
(Global Positioning System)} is the most common solution to the
localization problem.

\newterm{Differential GPS} involves a second ground receiver
with known location, providing millimeter accuracy under ideal
conditions.

The third important class is \newtermi[proprioceptive
sensor]{proprioceptive sensors}, which inform the robot of its own
motion.

To measure the exact configuration of a robotic joint, motors
are often equipped with \newtermi[shaft decoder]{shaft decoders} that
count the revolution of motors in small increments.

On mobile robots, shaft decoders that report wheel revolutions
can be used for \newtermi{odometry}---the measurement of distance
traveled.

\newterm[Inertial sensor]{Inertial sensors}, such as
gyroscopes, rely on the resistance of mass to the change of
velocity.

Other important aspects of robot state are measured by \newterm[force
sensor]{force sensors} and \newterm[torque
sensor]{torque sensors}.

To understand the design of effectors, it will help  to 
talk about motion and shape in the abstract, using the concept of
a \newtermi{degree of freedom} (DOF)
We count one
degree of freedom for each independent direction in which a robot, or
one of its effectors, can move.

These six degrees define the
\newtermi{kinematic state}\footnote{``Kinematic'' is from the Greek 
word for {\em motion}, as is ``cinema.''}

or \newtermi{pose} of the robot.

The
\newtermi{dynamic state} of a robot includes these six plus an additional
six dimensions for the rate of change of each kinematic dimension, that is, their velocities.

The arm in \figref{DOF-figure}(a) has
exactly six degrees of freedom, created by five \newterm[revolute
joint]{revolute joints} that generate
rotational motion and one \newterm[prismatic joint]{prismatic
joint} that generates sliding motion.

Thus, the car
has three \newterm[effective DOF]{effective degrees of
freedom} but two
\newterm[controllable DOF]{controllable degrees of
freedom}.

We say a robot is
\newtermi{nonholonomic} if it has more effective DOFs than
controllable DOFs and \termi{holonomic} if the two numbers are the
same.

\newterm{Differential
  drive} robots possess two independently
actuated wheels (or tracks), one on each side, as on a military tank.

An alternative is the \newtermi{synchro drive}, in which
each wheel can move and turn around its own axis.

This robot is \newterm{dynamically
stable}, meaning that it can remain upright
while hopping around.

A robot that can remain upright without moving
its legs is called \newterm{statically
stable}.

The \newtermi{electric motor} is the most popular mechanism
for both manipulator actuation and locomotion, but
\newterm{pneumatic actuation} using compressed gas 
and \newterm{hydraulic actuation} using pressurized fluids 
also have their application niches.

The probability $\zt
\pv(\X_{t+1}\mid \x_{t},a_{t})$ is called the \termi{transition model} or \newtermi{motion model}, and  $\zt \pv(\z_{t+1}\mid
\X_{t+1})$ is the \term{sensor model}.

$\zt


\subsection{Localization and mapping}


\newterm{Localization} is the
 problem of finding out 
where things are---including the robot itself.

The first assumes that the sensors detect {\em stable}, {\em recognizable}
features of the environment called \newterm[landmark]{landmarks}.


Localization using particle
filtering is called \newtermi{Monte Carlo localization}, or MCL.

Thus, localization algorithms using the Kalman filter 
\newtermi[linearization]{linearize} the motion and sensor models.

Such a linearization is called
(first degree) \newtermi{Taylor expansion}.

This problem is important
for many robot applications, and it has been studied extensively under
the name \newtermi{simultaneous localization and mapping},
abbreviated as {\bf SLAM}.

Such an approach
is called \newtermi{low-dimensional embedding}.

are called \newterm[self-supervised
learning]{self-supervised}.

The \newtermi{point-to-point motion} problem is to
deliver the robot or its end effector to a designated target
location.

A greater challenge is the \newtermi{compliant motion}
problem, in which a robot moves while being in physical contact with
an obstacle.

The \newtermi{path planning} problem is to
find a path from one configuration to another in configuration space.

They constitute what is
known as \newtermi{workspace representation}, since the coordinates of
the robot are specified in the same coordinate system as the objects
it seeks to manipulate (or to avoid).

This is because of the \newtermi{linkage constraints} on
the space of attainable workspace coordinates.


It turns out to be easier to plan with a \newtermi{configuration
space} representation.

This chain of
coordinate transformation is known as \newtermi{kinematics}.

The inverse problem of calculating the configuration of a robot whose
effector location is specified in workspace coordinates is known as
\newtermi{inverse kinematics}.

The configuration space can be decomposed into
two subspaces: the space of all configurations that a robot may
attain, commonly called \newtermi{free space}, and the space of
unattainable configurations, called \newtermi{occupied space}.

\subsection{Cell decomposition methods}

The first approach to path planning uses \newtermi{cell
decomposition}---that is, it decomposes the free space into a finite
number of contiguous regions, called cells.

A
second way to obtain a complete algorithm is to insist on an
\newterm{exact cell decomposition}
of the free space.

One
algorithm that implements this is  \newtermi{hybrid A*}.

This problem can be solved by introducing a \newtermi{potential
field}.


\subsection{Skeletonization methods}

The second major family of path-planning algorithms is based on the idea
of \newtermi{skeletonization}.

\figref{FigArm5} shows an example skeletonization: it is a
\newtermi{Voronoi graph} of the free space---the set of all
points that are equidistant to two or more obstacles.

An alternative to the Voronoi graphs is the \newtermi{probabilistic
roadmap}, a skeletonization approach that offers more possible routes,
and thus deals better with wide-open spaces.

To do so, it is common practice to extract the \newtermi{most likely
  state} from the probability distribution produced by the state
estimation algorithm.

This is the \newtermi{online replanning} technique of
\secref{execution-monitoring-section}.

In robotics, policies are 
called \newterm[navigation function]{navigation functions}.

For example, if it
is uncertain about a critical state variable, it can rationally invoke
an \newterm{information gathering
action}.

For example,
the \newtermi{coastal navigation} heuristic requires the robot to stay
near known landmarks to decrease its uncertainty.

\subsection{Robust methods}


Uncertainty can also be handled using so-called \newterm{robust
  control} methods (see
\pgref{robust-control-page}) rather than probabilistic methods.

Here, we look at a
robust method that is used for \newterm{fine-motion
planning}  (or FMP\index{FMP|see{planning,
fine-motion}}) in robotic assembly tasks.


A fine-motion plan consists of a series of \newterm[guarded motion]{guarded motions}\tindex{motion!guarded}.

The motion commands are typically
\newterm[compliant motion]{compliant motions}\tindex{motion!compliant} that allow the effector to slide if the motion command
would cause collision with an obstacle.

Such models are typically expressed via
\newterm[differential equation]{differential equations}, which
are equations that relate a quantity (e.g., a kinematic state) to the
change of the quantity over time (e.g., velocity).

A common technique to compensate for the limitations of kinematic
plans is to use a separate mechanism, a \newtermi{controller}, for
keeping the robot on track.

If the objective is to keep the robot on
a preplanned path, it is often referred to as a
\newtermi{reference controller} and the path is called a
\newtermi{reference path}.

Controllers that optimize a global cost
function are known as \newtermi{optimal controllers}.

Controllers that provide force in negative proportion to
the observed error are known as \newtermi[P controller]{P
  controllers}.


Here $\zt x_t$ is the state of the robot at time $\zt t$ and $\zt K_P$
is a constant known as the \newtermi{gain parameter} of the controller
and its value is called the gain factor); \(K_p\) regulates how
strongly the controller corrects for deviations between the actual
state $\zt x_t$ and the desired one $\zt y(t)$.

In particular, a reference controller
is said to be \newterm{stable} if
small perturbations lead to a bounded error between the robot and the
reference signal.

It is said to be \newterm{strictly
stable} if it is able to return to and then
stay on its reference path upon such perturbations.

The simplest controller that achieves strict stability in
our domain is a \newtermi{PD controller}.

A controller with all three terms is called a
\newtermi{PID controller} (for proportional integral derivative).

In some cases, a reflex agent
architecture using \newterm{reactive control}\ntindex{reactive
  control} is more appropriate.

For the hexapod robot we first choose a \newtermi{gait}, or pattern of
movement of the limbs.

Behavior
that emerges through the interplay of a (simple) controller and a
(complex) environment is often referred to as \newtermi{emergent
behavior}.


\section{Robotic Software Architectures}

A methodology for structuring algorithms is called a
\newtermi{software architecture}.

Architectures that
combine reactive and deliberate techniques are  called
\newterm[hybrid architecture]{hybrid architectures}.

\subsection{Subsumption architecture}

The \newtermi{subsumption architecture}~\cite{Brooks:1986} is a
framework for assembling reactive controllers out of finite state
machines.

The resulting
machines are  refereed to as \newtermi[augmented finite state
machine]{augmented finite state machines}, or AFSMs, where the
augmentation refers to the use of clocks.

The most
popular hybrid architecture is the \newterm{three-layer
  architecture}, which consists of a
reactive layer, an executive layer, and a deliberative layer.

The \newtermi{reactive layer} 
provides low-level control to the robot.

The \newtermi{executive layer}  (or sequencing layer)
serves as the glue between the reactive layer and the deliberative layer.

The \newtermi{deliberative layer}
generates global solutions to complex tasks using planning.


Another architecture for robots is known as the \newterm{pipeline
architecture}.

Data enters this pipeline at the
\newtermi{sensor interface layer}.

The \newtermi{perception layer}
then updates the robot's internal models of the environment based on
this data.

Next, these models are handed to the \newtermi{planning and
control layer}, which adjusts the robot's internal plans turns them
into actual controls for the robot.

Those are then communicated back
to the vehicle through the \newtermi{vehicle interface layer}.

Unmanned air vehicles known as \newterm[drone]{drones} are used in
military operations.

The world's
most popular mobile robot is a personal service robot: the robotic
vacuum cleaner \newtermi{Roomba}, shown in \figref{roomba}(a).

In \figref{bigdog+robocup-figure}(b) we see
\newterm{robotic soccer}, a competitive game
very much like human soccer, but played with autonomous mobile
robots.

The first commercial robot was a robot arm called \newterm{Unimate},
short for {\em universal automation}, developed by Joseph
Engelberger and George Devol.

Unimation followed up in 1978 with the development of the
\newterm{PUMA} robot, short for Programmable Universal Machine for Assembly.

The second thread began with the development of
the \newterm{occupancy grid} representation for probabilistic mapping,
which specifies the probability that each $(x,y)$ location is occupied
by an obstacle~\cite{Moravec+Elfes:1985}.

The latter work introduced the term
\newterm{Markov localization}.

The \newterm{Rao-Blackwellized particle
filter} combines particle filtering for robot localization with exact
filtering for map building~\cite{Murphy+Russell:2001,Montemerlo+al:2002}.

The study of manipulator robots, originally called \newterm{hand--eye
machines}, has evolved along quite
different lines.

A series of papers by Schwartz and
Sharir on what they called \newterm{piano movers}\index{problem!piano
movers} problems~\cite{Schwartz87a} was highly influential.

The earliest skeletonization algorithms were based on Voronoi
diagrams~\cite{Rowat79a} and \newtermi[visibility graph]{visibility
graphs}~\cite{Wesley79a}.

The area of \newterm{grasping}
is also important in robotics---the problem of determining a stable
grasp is quite difficult~\cite{Mason85a}.

Competent grasping requires
touch sensing, or \newterm{haptic feedback}, to determine contact
forces and detect slip~\cite{Fearing+Hollerbach:1985}.

In mobile robotics, this idea was viewed as
a practical solution to the collision avoidance
problem, and was later extended into an algorithm called
\newtermi{vector field histograms} by
Borenstein~\citeyear{Borenstein91a}.

The
\newtermi{Robocup} competition, launched in 1995 by Kitano and
colleagues~\citeyear{Kitano97a}, aims to ``develop a team of
fully autonomous humanoid robots that can win against the human world
champion team in soccer'' by 2050.

The \newterm{DARPA Grand Challenge},
organized by DARPA in 2004 and 2005, required autonomous robots to
travel more than 100 miles through unrehearsed desert terrain in less
than 10 hours \cite{Buehler06a}.

DARPA then organized the \newterm{Urban Challenge}, a competition in which
robots had to navigate 60 miles in an urban environment with other
traffic.

First, some terminology: the assertion that machines could act
{\em as if} they were intelligent is called the \newtermi{weak
AI} hypothesis, and the assertion that machines
that do so are {\em actually} thinking (not just {\em simulating}
thinking) is called the \newtermi{strong AI} hypothesis.

The inability to capture everything in a set of logical
rules is called the \newtermi{qualification problem} in AI.

The
\newtermi{embodied cognition} approach claims that it makes no sense
to consider the brain separately: cognition takes place within a
body, which is embedded in an environment.

}, in his famous paper ``Computing Machinery and
Intelligence''~\citeyear{Turing:1950}, suggested that instead of asking whether
machines can think, we should ask whether machines can pass a behavioral test,
which has come to be called the \newtermi{Turing Test}.

Nevertheless, Turing says, ``Instead of arguing
continually over this point, it is usual to have the \newtermi{polite
convention} that everyone thinks.''

His famous \newtermi{Chinese room} argument~\cite{Searle:1990} goes as follows:
Imagine a human, who understands only English, inside a room that contains a
rule book, written in English, and various stacks of paper.

Searle~\citeyear{Searle:1980} is a proponent of \newtermi{biological
naturalism}, according to which mental states are high-level emergent features
that are caused by low-level physical processes {\em in the neurons}, and it is
the (unspecified) properties of the neurons that matter: according to Searle's
biases, neurons have ``it'' and transistors do not.

\subsection{Consciousness and qualia}

Running through all the debates about strong AI is the issue of
\newtermi{consciousness}: awareness of the outside world, and of the self, and
the {\em subjective experience} of being a mind.

The technical term for the intrinsic nature of
experiences is \newtermi{qualia} (from the Latin word meaning, roughly, ``of
what kind'').

Nick Bostrom \citeyear{Bostrom:2003} speculated that we might all be living in
a computer \newtermi{simulation}, and thus that the qualia we feel might be
imaginary.

The DOD has long relied on a \newtermi{offset strategy} in which technology
provides an advantage.

The difficulty is dealing with \newtermi{dual use} technologies, those that
have peaceful as well as lethal uses.

\subsection{Surveillance, security, and privacy}

Technology has often been invoked for purposes of \newtermi{surveillance}.

There are about 2 million \newtermi{surveillance cameras} in
the {U.K.}, and 20 million in China, which currently has 100\
surveillance, and plans to cover the rest of the country and impose a ``social
credit'' system that would rank citizens on how they behave, punishing them for
crimes such as jaywalking, overuse of toilet paper, or associating with the
wrong people.

Machine learning can be a powerful tool for both sides in the
\newtermi{cybersecurity} battle.

One key practice
is \newtermi{de-identification}: sharing health records, after eliminating
personally identifying information (such as name and social security number) so
that medical researchers can use the data to advance the common good.

In the \newtermi{Netflix Prize} competition,
de-identified records of individual movie ratings were released, and
competitors were asked to come up with a machine learning algorithm that could
accurately predict which movies an individual would like.

A useful property is
\newtermi{k-anonymity}: a data base is \(k\)-anonymized if every record in the
data base is indistinguishable from at least \(k - 1\) other records.

An alternative to sharing de-identified records is to keep all records private,
but allow aggregate \newtermi{querying}.

A stronger guarantee is \newtermi{differential privacy}, which assures that an
attacker cannot use queries to re-identify any individual in the data base,
even if the attacker can make multiple queries and has access to separate
linking data bases.

An approach called \newtermi{federated learning}
\cite{Konevcny+al:2016} has no central data base; instead, users maintain their
own local data bases that keep their data completely private.

Machine learning models can perpetuate \newtermi{societal bias}.

One notion is that predictions should be
\newtermi{well-calibrated}: all the individuals who are given the same score by
the algorithm should have approximately the same probability of being a
positive instance, regardless of race.

But another important notion of fairness
is \newtermi{balance}: each race should have roughly equal false-positive and
false-negative rates.

Those in the fairness community talk about
\newtermi{equal impact} \cite{Chouldechova:2017}: the expected utility of the
outcome given a prediction should be the same across all races.

Even in the absence of societal bias, \newtermi{sample size disparity} can
cause bias.

This is
similar to the \newtermi{data sheets} that accompany electronic components such
as resistors; they allow designers to decide whether their intended use is
feasible or not.

One way a utility function can go wrong is
\newtermi{unintended side-effects}.

One way to deal with this is to design a robot to
have \newtermi{low impact} \cite{Armstrong+Levinstein:2017}: instead of just
maximizing utility, maximize the utility minus a summary of all changes to the
state of the world.

Another way utility functions can go wrong is \newtermi{externalities}, the
word used by economists for factors that are outside of what is measured and
paid for.

Ecologist Garrett \citeA{Hardin:1968} called
this the \newtermi{tragedy of the commons}.

However, AI systems may be
different because of
their potential to rapidly self-improve, as considered by I. J.
Good~\citeyear{Good:1965}:


Let an \newtermi{ultraintelligent machine} be defined as a machine
that can far surpass all the intellectual activities of any man
however clever.


Good's ``intelligence explosion'' has also been called the
\newterm{technological singularity} by
mathematics professor and science fiction author Vernor Vinge, who wrote in
1993: \nocite{Vinge:1993} ``Within thirty years, we will have the technological
means to create superhuman intelligence.

But so-called \newtermi{wicked
problems} with underspecified, contradictory, and ever-changing requirements
may not fall so easily.

The
\newtermi{transhumanist} social movement looks forward to this future in which
humans are merged with---or replaced by---robotic and biotech inventions.

People
need to be able to \newtermi{trust} the systems they use.

To earn trust, any engineered systems must go through a process of
\newtermi{verification and validation} (V\&V).

One instrument of trust is \newtermi{certification}; for example, Underwriters
Laboratories (UL) was founded in 1894 at a time when consumers had apprehension
and fear about electric power.

Another aspect of trust is \newtermi{transparency}: consumers want to know what
is going on inside a system, and that nothing is hidden from them.

If the system can explain itself, we call it
\newtermi{explainable AI} (XAI).

Technological innovations have often put some people out of work: weavers were
replaced by automated looms in the 1810s (leading to the \indextext{Luddite}
protests); farm laborers were replaced by machinery starting in the 1930s
(leading John Maynard Keynes\nindex{Keynes, J. M.} to coin the term
\newtermi{technological unemployment}); bank tellers were replaced by ATM
machines in the 2000s.

Even if automation has a multi-trillion dollar net positive impact, there may
still be problems due to the \newtermi{pace of change}.

Technology tends to magnify \newtermi{income inequality}.


Science fiction writer Isaac Asimov \citeyear{Asimov:1942,Asimov:1950} was the
first to address the issue of robot ethics, with his \newtermi{laws of
robotics}:


\citeA{Yudkowsky:2008} goes into more detail about how to
design a \newtermi{Friendly AI}.

Tristan Harris's
\newtermi{time well-spent} movement is a step towards giving us more
well-rounded choices \cite{Harris:2016}.

The
work of \citeA{Bengio+LeCun:2007} is one step towards this integration;
recently Yann LeCun has suggested that the term ``deep learning'' should be
replaced with the more general \newtermi{differentiable programming} (see
\cite{Siskind+Pearlmutter:2016,Li+al:2018}); this suggests that our general
programming languages and our machine learning models could be merged together.

LeCun uses
the term \newtermi{predictive learning} for an unsupervised learning system
that can model the world and learn to predict aspects of future states of the
world---not just predict labels for inputs that are i.i.d{.}

An important recent development is the shift from shared data to
\newtermi{shared models}.

These issues
are usually studied under the heading of \newterm{real-time
AI}.

The first useful idea is the \newterm[anytime algorithm]{anytime
algorithms}~\cite{Dean+Boddy:1988,Horvitz:1987a}: an
algorithm whose output quality improves gradually over time, so that it has a
reasonable decision ready whenever it is interrupted.

The second technique for controlling deliberation is
\newterm{decision-theoretic metareasoning}\ntindex{metareasoning!decision-theore
 tic} \cite{Russell+Wefald:1989,Horvitz+Breese:1996,Hay+al:2014}.

Metareasoning is one specific example of a \newterm{reflective
architecture}---that is, an architecture that
enables deliberation about the computational entities and actions occurring
within the architecture itself.

Here are four candidates:


	\item \newterm{Perfect rationality}, in
which an agent always chooses actions that maximize expected utility, given the
available information.

\item \newterm{Calculative rationality}, in
which an agent {\em eventually} returns what {\em would have been} the rational
choice, but may end up being sub-optimal if the choice arrives too late.

\item \newterm{Bounded rationality}, in which the
agent attempts to optimize the tradeoff between thinking longer (and possibly
getting closer to calculative rationality) versus cutting off thinking (and
possibly reaping more utility from prompt action).

\item \newterm{Bounded optimality} or BO
\cite{Russell:2016}, in which the agent program has the highest possible
expected utility, over all possible programs that operate with the same
computational resources.

Simon said that humans use a \newtermi{satisficing} approach that stops
deliberating when it finds a solution that is ``good enough.''

We define \newtermi{asymptotic bounded optimality} (ABO\index{ABO
(Asymptotic Bounded Optimality)}) as follows~\cite{Russell+Subramanian:1995}:
Suppose a program \(P\) is bounded optimal for a machine \(M\) in a class of
environments \(\mbf{E}\), where the complexity of environments in \(\mbf{E}\)
is unbounded.

The first is
\newterm{benchmarking}---running the 
algorithms on a computer and measuring speed in seconds and memory
consumption in bytes.

The second approach relies on a mathematical \newterm{analysis of
algorithms}, independently of the
particular implementation and input, as discussed below.


The \(O()\) notation gives us what is called an
\newterm{asymptotic analysis}.

The field of \newterm{complexity
analysis} analyzes problems rather than
algorithms.

Those who are interested in deciding whether P = NP look at a subclass of
NP called the \newterm{NP-complete}
problems.

The class \newtermi{co-NP} is the complement of
NP, in the sense that, for every decision problem in NP, there is a
corresponding problem in co-NP with the ``yes'' and ``no'' answers
reversed.

We know that P is a subset of both NP and co-NP, and
it is believed that there are problems in co-NP that are not in P.
The \newtermi{co-NP-complete} problems are the hardest problems in co-NP.

It is believed that PSPACE-hard problems are worse than
\indextext{NP-complete} problems, although it could turn out that NP = PSPACE,
just as it could turn out that P = NP.


\section{Vectors, Matrices, and Linear Algebra}

Mathematicians define a \newtermi{vector} as a member of a vector
space, but we will use a more concrete definition: a vector is an
ordered sequence of values.

A \newtermi{matrix} is a rectangular array of values arranged into
rows and columns.

The \newterm{identity matrix})} \(\I\) has elements \(\I_{i,j}\) equal to 1 when \(i\eq j\)
and equal to 0 otherwise.

The \newtermi{transpose} of \(\A\), written \(\A{\transpose}\) is formed by turning rows into
columns and vice versa, or, more formally, by \(\A{\transpose}_{i,j} \eq \A_{j,i}\).

The \newterm{inverse} of a square matrix \(\A\)
is another square matrix \(\A^{-1}\) such that \(\A^{-1}\A\eq \I\).

For a \newtermi{singular} matrix, the inverse does not exist.

Therefore, we define a \newterm{probability density function}, which
we also denote as \(P(\cdot)\), but which has a slightly different meaning
from the discrete probability function.


We can also define a \newtermi{cumulative probability density function}
\(F_X(x)\), which is the probability of a random variable being less than \(x\):

F_X(x) =  P(X\leq x) = \int_{-\infty}^x P(u)\,du \ .

\medskip

One of the most important probability distributions is the
\newtermi{Gaussian distribution}, also known as the \termi{normal distribution}.

With mean \(\mu \eq  0\) and variance \(\sigma^2 \eq  1\), we get the special
case of the \newterm{standard normal distribution}.

For a distribution
over a vector \(\x\) in \(n\) dimensions, there is the
\newterm{multivariate Gaussian} distribution:
 P(\x) \eq  \frac{1}{\sqrt{(2\pi)^n |\covariance|}}e^{-\frac{1}{2}\left(
          (\x-\mean)\transpose \covariance^{-1} (\x-\mean)\right)}\ ,

where \(\mean\) is the mean vector and \(\covariance\) is the 
\term{covariance matrix}\tindex{covariance matrix} (see below).

In one dimension, we can define the \newterm{cumulative distribution} function \(F(x)\)
as the probability that a random variable will
be less than \(x\).

The \newtermi{central limit theorem} states that the distribution
formed by sampling \(n\) independent random variables and taking their
mean tends to a normal distribution as \(n\) tends to infinity.


The \newtermi{expectation} of a random variable, \(E(X)\), is the mean or
average value, weighted by the probability of each value.


For a continuous variable, replace the summation with an integral
over the  probability density function,  \(P(x)\):
 E(X) \eq \int\limits_{-\infty}^{\infty} x P(x) \, dx \ , 


\medskip
The \newtermi{root mean square}, RMS, of a set of values (often samples of a
random variable) is the square root of the mean of the squares of the
values,
 \J{RMS}(x_1,\ldots,x_n) = \sqrt{\frac{x_1^2 + \ldots + x_n^2}{n}} \ .


\noindent The \newtermi{covariance} of two random variables is the expectation of the product of their differences from their means:

 \mbox{cov}(X, Y) = E((X - \mu_X)(Y - \mu_Y)) \ .


The \newtermi{covariance matrix}, often denoted \(\covariance\), is a
matrix of covariances between elements of a vector of random
variables.

The particular type of grammar we use
is called a \newtermi{context-free grammar}, because each expression
has the same form in any context.

We write our grammars in a formalism
called \newtermi{Backus--Naur form (BNF)}\index{BNF (Backus--Naur
form)}.

There are four components to a BNF grammar:

\item A set of \newterm[terminal symbol]{terminal symbols}\tindex{terminal symbol}.

\item A set of \newterm[nonterminal symbol]{nonterminal symbols}\tindex{nonterminal symbol} that 
categorize subphrases of the language.

\item A \newterm{start symbol}, which is the 
nonterminal symbol that denotes the complete set of strings of the language.