Objective function J(θ) = ΣJ^(i)(θ)
procedure SGD(D, θ^(0))
θ ← θ^(0)
while not converged do
i ~ Uniform({1,2,...,N})
θ ← θ - γ▽J^(i)(θ)
return θ
procedure SGD(D, θ^(0))
θ ← θ^(0)
while not converged do
for i ∈ shuffle({1,2,...,N}) do
θ ← θ - γ▽J^(i)(θ)
return θ
- It is common to implemement SGD using sampling without replacement
- epoch - single pass through the training data
- For GD, only one update per epoch
- For SGD, N updates per epoch (N = # of train examples)
- SGD reduces MSE much more rapidly than GD
SGD applied to Linear Regression is called the “Least Mean Squares” algorithm
procedure LMS(D, θ^(0))
θ ← θ^(0)
while not converged do
for i ∈ shuffle({1,2,...,N}) do
g ← (θ^Tx^(i) - y^(i))x^(i)
θ ← θ - γg
return θ
Gradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function
procedure GDLR(D, θ^(0))
θ ← θ^(0)
while not converged do
g ← Σ(θ^Tx^(i) - y^(i))x^(i)
θ ← θ - γg
return θ
- assumes output generated by a deterministic target function
- x^(i) ~ p*(·)
- y^(i) ~ c*(x^(i))
- assumes output generated from a conditional probability distribution
- x^(i) ~ p*(·)
- y^(i) ~ p*(·|x^(i))
- An oracle knows everything (e.g. usually unknown p*(y|x))
- Optimal classifier for 0/1 loss function
- y ∈ {0,1}
- y^ = h(x) =
- 1 if p(y=1|x) >= p(y=0|x)
- 0 otherwise
- = argmax{y ∈ {0,1}} p(y|x)
- Reducible error
- Irreducible error
Choose parameters that make the data most likely
Assumes: Data generated independent and identically-distributed random (iid) from distribution p*(x|θ*) = ΠP(x^(i)) independent and identically-distributed random and comes from a family of distinct parametrized θ ∈ H (set of possible parameters)
- log is monotonic
- θMLE
- = argmax p(D|θ)
- = argmax log p(D|θ) usually a constrained optimization
- = argmax l(θ) where l(θ) = log p(D|θ) is treated as function of θ where D is constant
Bad Idea #1: Bernoulli Classifier (Majority Vote)
y x1 x2 1 0.5 9 0 3 4 1 2 1 1 1 -3
- Ignore x
Model: y ~ Bernoulli(φ)
p(y|x) =
- φ if y = 1
- 1 - φ if y = 0
Conditional log-likelihood
- l(φ)
- = log p(D|φ)
- = Σlog p(y^(i)|x^(i))
- = logφ + log(1-φ) + logφ + logφ
- = 3logφ + log(1-φ)
- l(φ)
φMLE = argmax{φ ∈ {0,1}} l(φ) = 3 / 4
Bayes Classifier
- y^
- = h_φMLE(x)
- = argmax{y ∈ {0,1}} p(y|x,φMLE) constrained optimization
- = 1
- Majority Vote
- y^