-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathlogistic_regression.tex
246 lines (221 loc) · 10.3 KB
/
logistic_regression.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
\chapter{Logistic regression}
\label{chap-logistic}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{A new hypothesis class: linear logistic classifiers}
For classification, it is natural to make predictions in $\{+1, -1\}$
and use the $0-1$ loss function. However, even for simple linear
classifiers, it is very difficult to
find values for $\theta, \theta_0$ that minimize simple training error
\[J(\theta, \theta_0) = \frac{1}{n} \sum_{i=1}^n \mathcal{L}(\text{sign}(\theta^T\ex{x}{i} + \theta_0),
\ex{y}{i})\;\;.\] This problem is NP-hard, which probably
\note{The ``probably'' here is not because we're too lazy to look it
up, but actually because of a fundamental unsolved problem in
computer-science theory, known as ``P vs. NP.''}
implies
that solving the most difficult instances of this problem would
require computation time {\em exponential} in the number of training
examples, $n$.
What makes this a difficult optimization problem is its lack of
``smoothness'':
\begin{itemize}
\item There can be two hypotheses, $(\theta, \theta_0)$ and
$(\theta', \theta_0')$, where
one is closer in parameter space to the optimal parameter values
$(\theta^*, \theta_0^*)$, but they make the same number of
misclassifications so they have the same $J$ value.
\item All predictions are categorical: the classifier can't express a
degree of certainty about whether a particular input $x$ should have
an associated value $y$.
\end{itemize}
For these reasons, if we are considering a hypothesis $\theta,\theta_0$
that makes five incorrect predictions, it is difficult to see how we
might change $\theta,\theta_0$ so that it will perform better, which
makes it difficult to design an algorithm that searches through the
space of hypotheses for a good one.
For these reasons, we are going to investigate a new hypothesis class:
{\em linear logistic classifiers}. These hypotheses are still
parameterized by a $d$-dimensional vector $\theta$ and a scalar
$\theta_0$, but instead of making predictions in $\{+1, -1\}$, they
generate real-valued outputs in the interval $(0, 1)$. A linear
logistic classifier has the form
\[h(x; \theta, \theta_0) = \sigma(\theta^T x + \theta_0)\;\;.\]
This looks familiar! What's new?
The {\em logistic} function, also known as the {\em sigmoid} function,
is defined as
\[\sigma(z) = \frac{1}{1+e^{-z}}\;\;,\] and plotted below, as a
function of its input $z$.
Its output can be interpreted as a probability, because for any value of
$z$ the output is in $(0, 1)$.
\begin{center}
\begin{tikzpicture}
\begin{axis}[
axis lines=middle,
% axis equal image,
scale only axis,
width=0.6\textwidth,
height=0.3\textwidth,
xmin=-4, xmax=4,
ymin=-0.25, ymax=1.25,
xlabel={$z$}, ylabel={$\sigma(z)$},
]
\addplot [domain=-4:4, samples=100, ultra thick] {1/(1+exp(-x))};
\addplot [domain=-4:4, samples=2, dashed] {1};
\addplot [domain=-4:4, samples=2, dashed] {0};
\end{axis}
\end{tikzpicture}
\end{center}
\question{Convince yourself the output of $\sigma$ is always in the
interval $(0, 1)$. Why can't it equal 0 or equal 1? For what value
of $z$ does $\sigma(z) = 0.5$?}
What does a linear logistic classifier (LLC) look like? Let's consider the
simple case where $d = 1$, so our input points simply lie along the
$x$ axis. The plot below shows LLCs for three different parameter
settings: $\sigma(10x + 1)$, $\sigma(-2x + 1)$, and $\sigma(2x - 3).$
\begin{center}
\begin{tikzpicture}
\begin{axis}[
axis lines=middle,
% axis equal image,
scale only axis,
width=0.6\textwidth,
height=0.3\textwidth,
xmin=-4, xmax=4,
ymin=-0.25, ymax=1.25,
xlabel={$x$}, ylabel={$\sigma(\theta^T x + \theta_0)$},
]
\addplot [domain=-4:4, samples=100, ultra thick, red] {1/(1+exp(-(10*x + 1)))};
\addplot [domain=-4:4, samples=100, ultra thick, blue] {1/(1+exp(-(-2*x
+ 1)))};
\addplot [domain=-4:4, samples=100, ultra thick, green] {1/(1+exp(-(2*x
- 3)))};
\addplot [domain=-4:4, samples=2, dashed] {1};
\addplot [domain=-4:4, samples=2, dashed] {0};
\end{axis}
\end{tikzpicture}
\end{center}
\question{Which plot is which? What governs the steepness of the
curve? What governs the $x$ value where the output is equal to
0.5?}
But wait! Remember that the definition of a classifier from
chapter~\ref{chap-classification} is that it's a mapping from $\R^d
\rightarrow \{-1, +1\}$ or to some other discrete set. So, then, it
seems like an LLC is actually not a classifier!
Given an LLC, with an output value in $(0, 1)$, what should we do if
we are forced to make a prediction in $\{+1, -1\}$? A default answer
is to predict $+1$ if $\sigma(\theta^T x + \theta_0) > 0.5$ and $-1$
otherwise. The value $0.5$ is sometimes called a {\em prediction
threshold}.
In fact, for different problem settings, we might prefer to pick a
different prediction threshold. The field of {\em decision theory}
considers how to make this choice from the perspective of Bayesian
reasoning. For example, if the consequences of predicting $+1$ when
the answer should be $-1$ are much worse than the consequences of
predicting $-1$ when the answer should be $+1$, then we might set the
prediction threshold to be greater than $0.5$.
\question{Using a prediction threshold of 0.5, for what values of $x$
do each of the LLCs shown in the figure above predict $+1$?}
When $d = 2$, then our inputs $x$ lie in a two-dimensional space with
axes $x_1$ and $x_2$, and the output of the LLC is a surface, as shown
below, for $\theta = (1, 1), \theta_0 = 2$.
\includegraphics[width=0.7\textwidth]{figures/logreg3d}
\question{Convince yourself that the set of points for which
$\sigma(\theta^T x + \theta_0) = 0.5$, that is, the separator
between positive and negative predictions with prediction threshold
$0.5$ is a line in $(x_1, x_2)$ space. What particular line is it
for the case in the figure above?
How would the plot change for $\theta = (1, 1)$, but now
with $\theta_0 = -2$? For $\theta = (-1, -1), \theta_0 = 2$?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Loss function for logistic classifiers}
\label{logistic}
Recall that optimization is a key approach to solving machine learning
problems; this also applies to logistic regression, as we can see by
defining a lass function for this problem. Recall that a common loss function
is the 0-1 loss, introduced in Chapter~\ref{chap-intro}:
\[ L_{01}(h(x; \Theta), y) =
\begin{cases}
0 & \text{ if } y = h(x; \Theta)\\
1 & \text{ otherwise}
\end{cases}\;\;,
\]
which gives a value of 0 for a correct prediction, and a 1 for an
incorrect prediction. In the case of linear separators, this becomes:
\[ L_{01}(h(x;\theta, \theta_0), y) =
\begin{cases}
0 & \text{ if } y(\theta^Tx + \theta_0) > 0 \\
1 & \text{ otherwise}
\end{cases}\;\;.
\]
%
For logistic regression, we have defined a class, LLC, of hypotheses whose outputs are in $(0, 1)$,
but we have training data with $y$ values in $\{+1, -1\}$. How
can we define a loss function? Intuitively, we would like to have
{\em low loss if we assign a low probability to the incorrect class.}
We'll define a loss function, called {\em negative log-likelihood} (NLL),
that does just this. In addition, it has the cool property that it
extends nicely to the case where we would like to classify our inputs
into more than two classes.
In order to simplify the description, we will assume that (or transform so that) the labels
in the training data are $y \in \{0, 1\}$, enabling them to be
interpreted as probabilities of being a member of the class of
interest. \note{\bf Remember to be sure your $y$
values have this form if you try to learn an LLC using NLL!|}
We would like to pick the parameters of our classifier to maximize the
probability assigned by the LCC to the correct $y$ values, as
specified in the training set. Letting guess $\ex{g}{i} =
\sigma(\theta^T\ex{x}{i} + \theta_0)$,
that probability is
\begin{equation*}
\prod_{i = 1}^n \begin{cases} \ex{g}{i} & \text{if $\ex{y}{i} =
1$} \\ 1 - \ex{g}{i} & \text{otherwise}
\end{cases}\;\;,
\end{equation*}
under the assumption that our predictions are independent. This can
be cleverly rewritten, when $\ex{y}{i} \in \{0, 1\}$, as
\begin{equation*}
\prod_{i = 1}^n {\ex{g}{i}}^{\ex{y}{i}}(1 - \ex{g}{i})^{1 - \ex{y}{i}}\;\;.
\end{equation*}
\question{Be sure you can see why these two expressions are the
same.}
Now, because products are kind of hard to deal with, and because the
log function is monotonic, the $\theta, \theta_0$ that maximize the
log of this quantity will be the same as the $\theta, \theta_0$ that
maximize the original, so we can try to maximize
\begin{equation*}
\sum_{i = 1}^n \left( {\ex{y}{i}}\log {\ex{g}{i}} +
(1 - \ex{y}{i})\log(1 - \ex{g}{i})\right)\;\;.
\end{equation*}
We can turn the maximization problem above into a minimization problem by taking the negative
of the above expression, and write in terms of minimizing a loss
\begin{equation*}
\sum_{i = 1}^n \mathcal{L}_\text{nll}(\ex{g}{i}, \ex{y}{i})
\end{equation*}
where $\mathcal{L}_\text{nll}$ is the {\em negative log-likelihood}
loss function:
\begin{equation*}
\mathcal{L}_\text{nll}(\text{guess},\text{actual}) =
-\left(\text{actual}\cdot \log (\text{guess}) + (1 - \text{actual})\cdot\log (1 -
\text{guess})\right) \;\;.
\end{equation*}
This loss function is also sometimes referred to as the {\em log loss}
or {\em cross entropy}. \note{You can use any base for the logarithm
and it won't make any real difference. If we ask you for numbers,
use log base $e$.}
\section{Logistic classification as optimization}
We can finally put all these pieces together and develop an objective
function for optimizing regularized negative log-likelihood for a
linear logistic classifier. \note{That's a lot of fancy words!} In
fact, this process is usually called ``logistic regression,'' so
we'll call our objective $J_\text{lr}$, and define it as
\[J_\text{lr}(\theta, \theta_0; {\cal D}) =
\left(\frac{1}{n} \sum_{i=1}^n
\mathcal{L}_\text{nll}(\sigma(\theta^T \ex{x}{i} + \theta_0), \ex{y}{i})\right) +
\lambda \norm{\theta}^2\;\;.\]
\question{Consider the case of linearly separable data. What will
the $\theta$ values that optimize this objective be like if
$\lambda = 0$? What will they be like if $\lambda$ is very big?
Try to work out an example in one dimension with two data points.}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: