Lecture 09 Stochastic Gradient Descent + Probabilistic Learning

Stochastic Gradient Descent

Objective function J(θ) = ΣJ^(i)(θ)

procedure SGD(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        i ~ Uniform({1,2,...,N})
            θ ← θ - γ▽J^(i)(θ)
    return θ

procedure SGD(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        for i ∈ shuffle({1,2,...,N}) do
            θ ← θ - γ▽J^(i)(θ)
    return θ

It is common to implemement SGD using sampling without replacement
epoch - single pass through the training data
For GD, only one update per epoch
For SGD, N updates per epoch (N = # of train examples)
SGD reduces MSE much more rapidly than GD

SGD for Linear Regression

SGD applied to Linear Regression is called the “Least Mean Squares” algorithm

procedure LMS(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        for i ∈ shuffle({1,2,...,N}) do
            g ← (θ^Tx^(i) - y^(i))x^(i)
            θ ← θ - γg
    return θ

GD for Linear Regression

Gradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function

procedure GDLR(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        g ← Σ(θ^Tx^(i) - y^(i))x^(i)
        θ ← θ - γg
    return θ

Probabilistic Learning

Functional Approximation

assumes output generated by a deterministic target function
x^(i) ~ p*(·)
y^(i) ~ c*(x^(i))

Probabilistic Learning

assumes output generated from a conditional probability distribution
x^(i) ~ p*(·)
y^(i) ~ p*(·|x^(i))

Bayes Classifier

An oracle knows everything (e.g. usually unknown p*(y|x))
Optimal classifier for 0/1 loss function
- y ∈ {0,1}
- y^ = h(x) =
  - 1 if p(y=1|x) >= p(y=0|x)
  - 0 otherwise
  - = argmax{y ∈ {0,1}} p(y|x)
Reducible error
Irreducible error

Maximum Likelihood Estimation

Choose parameters that make the data most likely
Assumes: Data generated independent and identically-distributed random (iid) from distribution p*(x|θ*) = ΠP(x^(i)) independent and identically-distributed random and comes from a family of distinct parametrized θ ∈ H (set of possible parameters)
log-likelihood
- log is monotonic
- θMLE
  - = argmax p(D|θ)
  - = argmax log p(D|θ) usually a constrained optimization
  - = argmax l(θ) where l(θ) = log p(D|θ) is treated as function of θ where D is constant
Bad Idea #1: Bernoulli Classifier (Majority Vote)
- Data
  - y x1 x2
    
    1 0.5 9
    
    0 3 4
    
    1 2 1
    
    1 1 -3
- Assumption:
  - Ignore x
- Model: y ~ Bernoulli(φ)
- p(y|x) =
  - φ if y = 1
  - 1 - φ if y = 0
- Conditional log-likelihood
  - l(φ)
    - = log p(D|φ)
    - = Σlog p(y^(i)|x^(i))
    - = logφ + log(1-φ) + logφ + logφ
    - = 3logφ + log(1-φ)
- φMLE = argmax{φ ∈ {0,1}} l(φ) = 3 / 4
- Bayes Classifier
  - y^
    - = h_φMLE(x)
    - = argmax{y ∈ {0,1}} p(y|x,φMLE) constrained optimization
    - = 1
  - Majority Vote

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture9-logreg.md

lecture9-logreg.md

Lecture 09 Stochastic Gradient Descent + Probabilistic Learning

Stochastic Gradient Descent

SGD for Linear Regression

GD for Linear Regression

Probabilistic Learning

Functional Approximation

Probabilistic Learning

Bayes Classifier

Maximum Likelihood Estimation

y	x1	x2
1	0.5	9
0	3	4
1	2	1
1	1	-3

Files

lecture9-logreg.md

Latest commit

History

lecture9-logreg.md

File metadata and controls

Lecture 09 Stochastic Gradient Descent + Probabilistic Learning

Stochastic Gradient Descent

SGD for Linear Regression

GD for Linear Regression

Probabilistic Learning

Functional Approximation

Probabilistic Learning

Bayes Classifier

Maximum Likelihood Estimation