Skip to content

Latest commit

 

History

History
135 lines (100 loc) · 3.24 KB

lecture9-logreg.md

File metadata and controls

135 lines (100 loc) · 3.24 KB

Lecture 09 Stochastic Gradient Descent + Probabilistic Learning

Stochastic Gradient Descent

Objective function J(θ) = ΣJ^(i)(θ)

procedure SGD(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        i ~ Uniform({1,2,...,N})
            θ ← θ - γ▽J^(i)(θ)
    return θ
procedure SGD(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        for i ∈ shuffle({1,2,...,N}) do
            θ ← θ - γ▽J^(i)(θ)
    return θ
  • It is common to implemement SGD using sampling without replacement
  • epoch - single pass through the training data
  • For GD, only one update per epoch
  • For SGD, N updates per epoch (N = # of train examples)
  • SGD reduces MSE much more rapidly than GD

SGD for Linear Regression

SGD applied to Linear Regression is called the “Least Mean Squares” algorithm

procedure LMS(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        for i ∈ shuffle({1,2,...,N}) do
            g ← (θ^Tx^(i) - y^(i))x^(i)
            θ ← θ - γg
    return θ

GD for Linear Regression

Gradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function

procedure GDLR(D, θ^(0))
    θ ← θ^(0)
    while not converged do
        g ← Σ(θ^Tx^(i) - y^(i))x^(i)
        θ ← θ - γg
    return θ

Probabilistic Learning

Functional Approximation

  • assumes output generated by a deterministic target function
  • x^(i) ~ p*(·)
  • y^(i) ~ c*(x^(i))

Probabilistic Learning

  • assumes output generated from a conditional probability distribution
  • x^(i) ~ p*(·)
  • y^(i) ~ p*(·|x^(i))

Bayes Classifier

  • An oracle knows everything (e.g. usually unknown p*(y|x))
  • Optimal classifier for 0/1 loss function
    • y ∈ {0,1}
    • y^ = h(x) =
      • 1 if p(y=1|x) >= p(y=0|x)
      • 0 otherwise
      • = argmax{y ∈ {0,1}} p(y|x)
  • Reducible error
  • Irreducible error

Maximum Likelihood Estimation

  • Choose parameters that make the data most likely

  • Assumes: Data generated independent and identically-distributed random (iid) from distribution p*(x|θ*) = ΠP(x^(i)) independent and identically-distributed random and comes from a family of distinct parametrized θ ∈ H (set of possible parameters)

  • log-likelihood

    • log is monotonic
    • θMLE
      • = argmax p(D|θ)
      • = argmax log p(D|θ) usually a constrained optimization
      • = argmax l(θ) where l(θ) = log p(D|θ) is treated as function of θ where D is constant
  • Bad Idea #1: Bernoulli Classifier (Majority Vote)

    • Data

      • y x1 x2
        1 0.5 9
        0 3 4
        1 2 1
        1 1 -3
    • Assumption:

      • Ignore x
    • Model: y ~ Bernoulli(φ)

    • p(y|x) =

      • φ if y = 1
      • 1 - φ if y = 0
    • Conditional log-likelihood

      • l(φ)
        • = log p(D|φ)
        • = Σlog p(y^(i)|x^(i))
        • = logφ + log(1-φ) + logφ + logφ
        • = 3logφ + log(1-φ)
    • φMLE = argmax{φ ∈ {0,1}} l(φ) = 3 / 4

    • Bayes Classifier

      • y^
        • = h_φMLE(x)
        • = argmax{y ∈ {0,1}} p(y|x,φMLE) constrained optimization
        • = 1
      • Majority Vote