Function Approximation | Probabilistic Learning |
---|---|
Previously, we assumed that our output was generated using a deterministic target function | Today, we assume that our output is sampled from a conditional probability distribution |
$y^{(i)} \sim p^*(· | |
Our goal was to learn a hypothesis h(x) that best approximates |
Our goal is to learn a probability distribution $p(y |
- Discrete Random Variable
$X$ - Probability Mass Function (pmf)
$p(x)$ - Continuous Random Variable
$X$ - Probability Density Function (pdf)
$P(a \leq X \leq b) = \int f(x)dx$
- Cumulative Distribution Function
$F(x) = P(X \leq x) = \sum p(x') = \int f(x')dx'$
$P(A|B) = \frac{P(A,B)}{P(B)}$ - Beta Distribution
- pdf:
$f(\Phi|\alpha,\beta) = \frac{1}{B(\alpha,\beta)} x^{\alpha-1} (1-x)^{\beta-1}$
- pdf:
- Dirichlet Distribution
- pdf:
$p(\Phi|\alpha) = \frac{1}{B(\alpha)} \prod_{k=1}^K \Phi_k^{\alpha_k-1}$ where$B(\alpha) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma(\sum_{k=1}^K \alpha_k)}$
- pdf:
- Expectation
$E[X] = \sum xp(x)$ $E[X] = \sum xf(x)dx$
- Variance
$Var(X) = E[(X-E[X])^2] = E[X^2] - E[X]^2$ $\mu = E[X]$
- Suppose we have N samples
$D = {x^{(1)}, x^{(2)},\cdots, x^{(N)}}$ from a random variable X - The likelihood function:
- X is discrete with pmf
$p(x|\theta)$ $L(\theta) = p(x^{(1)}|\theta) p(x^{(2)}|\theta) ... p(x^{(N)}|\theta)$
- X is continuous with pdf
$f(x|\theta)$ $L(\theta) = f(x^{(1)}|\theta)f(x^{(2)}|\theta) \cdots f(x^{(N)}|\theta)$
- X is discrete with pmf
- The log-likelihood function:
- X is discrete with pmf
$p(x|\theta)$ $\ell(\theta) = \log p(x^{(1)}|\theta) + \log p(x^{(2)}|\theta) + \cdots + \log p(x^{(N)}|\theta)$
- X is continuous with pdf
$f(x|\theta)$ $\ell(\theta) = \log f(x^{(1)}|\theta) + \log f(x^{(2)}|\theta) + \cdots + \log f(x^{(N)}|\theta)$
- X is discrete with pmf
- In both cases, the likelihood tells us how likely one sample is relative to another
- The joint likelihood function:
- X and Y are discrete with pdf
$p(x,y|\theta)$ $L(\theta) = p(x^{(1)},y^{(1)}|\theta) p(x^{(2)},y^{(2)}|\theta) \cdots p(x^{(N)},y^{(N)}|\theta)$
- X and Y are continuous with pdf
$f(x,y|\theta)$ $L(\theta) = f(x^{(1)},y^{(1)}|\theta)f(x^{(2)},y^{(2)}|\theta) \cdots f(x^{(N)},y^{(N)}|\theta)$
- X and Y are discrete with pdf
- Suppose we have data
$D = {x^{(i)}}_{i=1}^N$ - Choose the parameters that maximize the likelihood of the data.
$\theta^{MLE} = argmax_{\theta} \prod_{i=1}^N p(x^{(i)}|\theta)$
- Assume data was generated i.i.d. from some model (i.e. write the generative story)
$x^{(i)} \sim p(x|\theta)$
- Write log-likelihood
$\ell(\theta) = \log p(x^{(1)}|\theta) + \cdots + \log p(x^{(N)}|\theta)$
- Compute partial derivatives (i.e. gradient)
$d\ell(\theta)/d\theta_1$ $d\ell(\theta)/d\theta_M$
- Set derivatives to zero and solve for \theta
-
$d\ell(\theta)/d\theta_m = 0$ for all$m \in {1, \cdots, M}$ -
$\theta^{MLE}$ = solution to system of M equations and M variables
-
- Compute the second derivative and check that
$\ell(\theta)$ is concave down at$\theta^{MLE}$
- Goal
- pdf of Exponential(λ):
$f(x) = \lambda e^{-\lambda x}$ - Suppose Xi ~ Exponential(λ) for 1 <= i <= N
- Find MLE for data
$D={x^{(i)}}_{i=1}^N$
- pdf of Exponential(λ):
- Steps
- First write down log-likelihood of sample
$\ell(\lambda) = N \log(\lambda) - \lambda \Sigma x^{(i)}$
- Compute first derivative, set to zero, solve for λ
$d\ell(\lambda)/d\lambda = \frac{N}{\lambda} - \Sigma_{i=1}^N x^{(i)} = 0$ $\lambda^{MLE} = \frac{N}{\Sigma x^{(i)}}$
- Compute second derivative and check that it is concave down at
$\lambda^{MLE}$
- First write down log-likelihood of sample
- Model
$x^{(i)} \sim Bernoulli(\Phi)$ - Log-likelihood
$D = {x^{(1)}, \cdots, x^{(N)}}$ $\ell(\Phi) = \log p(D|\Phi) \ = \log \prod_{i=1}^N p(x^{(i)})|\Phi) \ = \log \prod_{i=1}^N \Phi^{(x^{(i)})} (1-\Phi)^{(1-x^{(i)})} \ = N_1 \log \Phi + N_0 \log (1-\Phi)$ $N_1 = # of (x^{(i)}=1) N_0 = # of (x^{(i)}=0)$ $d\ell(\Phi)/d\Phi = \frac{N_1}{\Phi} - \frac{N_0}{1-\Phi}$ - Set to zero and solve
-
$\frac{N_1}{\Phi} - \frac{N_0}{1-\Phi} = 0$ =>$\Phi^{MLE} = \frac{N_1}{N_1+N_0} = \frac{N_1}{N}$
-
- Choose the parameters that maximize the posterior of the parameters given the data
$\theta^{MAP} = argmax_ {\theta} \prod_{i=1}^N p(\theta|x^{(i)}) = argmax_{\theta} \prod_{i=1}^N p(x^{(i)}|\theta)p(\theta)$ - where
$p(\theta)$ is seen as prior (usually a pdf)
- MLE:
$p(D|\theta)$ - MAP:
$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}$ <=> posterior = likelihood·prior / not a function of \theta (Bayes Rule) $$\theta^{MAP} = argmax_{\theta} p(\theta|D) \ = argmax_{\theta} \log p(\theta|D) \ = argmax_{\theta} \log (\frac{p(D|\theta)p(\theta)}{p(D)}) \ = argmax_{\theta} \log (p(D|\theta)p(\theta)) \ = \ell^{MAP}(\theta)$$
- Model:
$\Phi \sim Beta(\alpha,\beta)$ $x^{(1)} \sim Bernoulli(\Phi)$ $x^{(2)} \sim Bernoulli(\Phi)$ - ...
$x^{(N)} \sim Bernoulli(\Phi)$
- Log-likelihood
$$\ell^{MAP}(\Phi) = \log [p(D|\Phi)f(\Phi|\alpha,\beta)] \ = \log [(\Phi^{N_1}(1-\Phi)^{N_0})(\frac{1}{B(\alpha,\beta)}\Phi^{(\alpha-1)}(1-\Phi)^{\beta-1})] \ = \log [\Phi^{(N_1+\alpha-1)}(1-\Phi)^{(N_0+\beta-1)})\frac{1}{B(\alpha,\beta)}] \ = (N_1+\alpha-1)\log{\Phi}(N_0+\beta-1)\log{1-\Phi}-1 \ = N_1'\log{\Phi} + N_0'\log{(1-\Phi)} - \log{(B(\alpha,\beta))}$$ - Φ are parameters α,β are hyperparameters
- Derivative
$\frac{d\ell^{MAP}(\Phi)}{d\Phi} = \frac{N_1'}{\Phi} - \frac{N_0'}{1-\Phi}$
- Set to zero and solve
$\Phi^{MAP} = \frac{N_1'}{N_1'+N_0'} = \frac{N_1+\alpha-1}{N_1+\alpha-1+N_0+\beta-1}$
- Ex #1: Suppose D={8Heads, 2Tails}
- ΦMLE = 8/10 = 0.8
- Now if Φ ~ Beta(α=101, β=101)
- ΦMAP = (8+101-1) / (8+101-1+2+101-1) = 108/(108+102) ≈ 0.5
- prior are psuedo counts
- Now if Φ ~ Beta(α=101, β=1)
- ΦMAP = 108 / (108+2) ≈ 1.0
- Ex #2: D={108Heads, 102Tails}
- ΦMLE = 108/(108+102)
- Data:
$y \in {0,1}$ $x \in {0,1}^M$
- Model:
$y \sim Bernoulli(\Phi) = p(y|\Phi)$ $x_1 \sim Bernoulli(\Phi_{y,1}) = p(x_1|y,\Phi)$ $x_M \sim Bernoulli(\Phi_{y,M}) = p(x_M|y,\Phi)$ $\Phi \in [0,1]$ $\theta = [\theta_{H1} \theta_{H2} ... \theta_{HM} \theta_{T1} \theta_{T2} ... \theta_{TM}]$ $$p(x_1,x_2, \cdots, x_M, y|\Phi,\theta) \ = p(y|\Phi) p(x_1|y,\theta) p(x_2|y,\theta) \cdots p(x_M|y,\theta) \ = p(y|\Phi) \prod_{m=1}^M p(x_m|y, \theta_{H,m}, \theta_{T,m}) \ =\Phi^y(1-\Phi)^{(1-y)} \prod_{m=1}^M \theta_{y,m}^{X_m} (1-\theta_{y,m})^{(1-x_m)}$$
- Def: two random variable x, y are conditionally independent given random variable Z written
$X ⫫ Z$ if and only if$p(x,y|Z) = p(x|Z) p(y|Z)$ - Naive Bayes Assumption
- The features might not be independent
$p(x|y) = \prod_{m=1}^M p(x_M|y)$ - that is
$x_q$ and$x_r$ are conditionally independent given y
- Log-likelihood
$$\ell(\Phi,\theta) \ = \log \prod_{i=1}^N p(x^{(i)},y^{(i)}|\Phi,\theta) \ = \Sigma_{i=1}^N [\log p(y^{(i)}|\Phi) + \Sigma_{m=1}^M \log p(x_m^{(i)}|y^{(i)},\theta)] \ = N_{y=H}\log{\Phi} + N_{y=T}\log{1-\Phi} + \Sigma_{m=1}^M N_{x_m=1,y=H}\log{\theta_{H,M}} + N_{x_m=0,y=H}\log{1-\theta_{H,M}} + \Sigma_{m=1}^M N_{x_m=1,y=T}\log{\theta_{T,M}} + N_{x_m=0,y=T}\log{1-\theta_{T,M}}$$
- Case a)
$\Phi$ - Take partial derivatives wrt
$\Phi$ $\frac{d\ell(\Phi,\theta)}{d\Phi} = \frac{N_{y=1}}{\Phi} - \frac{N_{y=0}}{1-\Phi}$ - Set to zero and solve
$\Phi$ $\Phi^{MLE} = \frac{N_{y=1}}{N_{y=1}+N_{y=0}} = \frac{N_{y=1}}{N}$
- Take partial derivatives wrt
- Case b)
$\theta$ elements- Take partial derivatives wrt
$\theta_{H,M}$ (Case for y=H and feature m) $\frac{d\ell(\Phi,\theta)}{d\theta_{H,M}} = \frac{N_{x_m=1,y=H}}{\theta_{H,M}} - \frac{N_{x_m=0,y=H}}{1-\Phi_{H,M}}$ - Set to zero and solve for
$\theta_{H,M}$ $\theta_{H,M}^{MLE} = \frac{N_{x_m=1,y=H}}{N_{x_m=1,y=H}+N_{x_m=0,y=H}} = \frac{N_{x_m=1,y=H}}{N_{y=H}}$
- Take partial derivatives wrt