Realizable | Agonostic | |
---|---|---|
Finite $ | H | $ |
Infinite $ | H | $ |
-
$H[S]$ – the set of splittings of dataset$S$ using concepts (hypothesis) from$H$ -
$H$ shatters$S$ if$|H[S]| = 2^{|S|}$ -
$|H(S)|$ = # of splitting of S by$H < 2^{|S|}$ - A set of points
$S$ is shattered by$H$ is there are hypotheses in$H$ that split$S$ in all of the$2^{|𝑆|}$ possible ways - i.e., all possible ways of classifying points in
$S$ are achievable using concepts in$H$ .
- The VC-dimension of a hypothesis space
$H$ is the cardinality of (size of) the largest set$S$ that can be shattered by$H$ - If arbitrarily large finite sets can be shattered by H, then
$VCdim(H) = \infin$ - To show that VC-dimension is d:
- there exists a set of d points that can be shattered
- there is no set of d+1 points that can be shattered
- Fact: If
$H$ is finite, then$VCdim(H) \leq \log(|H|)$ - VCdim vs Shattering
- Proving VC Dimension requires us to show that there exists (∃) a dataset of size d that can be shattered and that there does not exist (∄) a dataset of size d+1 that can be shattered
- Proving that a particular dataset can be shattered requires us to show that for all (∀) labelings of the dataset, our hypothesis class contains a hypothesis that can correctly classify it
- Corollary 1 (Realizable, Finite|H|)
- For some
$\delta > 0$ , with probability at least$(1 - \delta)$ , for any h in$H$ consistent with the training data (i.e.$\hat{R}(h)=0$ ) $R(h) \leq \frac{1}{N}[\ln{(|H|)} + \ln{(\frac{1}{\delta})}]$
- For some
- Corollary 2 (Agnostic, Finite|H|)
- For some
$\delta > 0$ , with probability at least$(1 - \delta)$ , for all hypotheses h in$H$ $R(h) \leq \hat{R}(h) + \sqrt{(\frac{1}{N}[\ln{(|H|)} + \ln{(\frac{2}{\delta})}]}$
- For some
- Corollary 3 (Realizable, Infinite|H|)
- For some
$\delta > 0$ , with probability at least$(1 - \delta)$ , for any h in$H$ consistent with the training data (i.e.$\hat{R}(h)=0$ ) $R(h) \leq O(\frac{1}{N}[VC(H)\ln{(\frac{N}{VC(H)})} + \ln{(\frac{1}{\delta})}])$
- For some
- Corollary 4 (Agnostic, Infinite|H|)
- For some
$\delta > 0$ , with probability at least$(1 - \delta)$ , for all hypotheses h in$H$ $R(h) \leq \hat{R}(h) + O(\sqrt{(\frac{1}{N}[VC(H) + \ln{(\frac{1}{\delta})}})$
- For some
- key idea: tradeoff between low train error and keeping H simple (low VCdim)
- Ex: Linear Seperable in
$R^M$ $VC(H) = M+1$ - How to tradeoff?
- Use a regularizer r(θ) = Σ_{m=1}^M|θm|
- θ = argmin J(θ) + r(θ)
- Given data D={x^(i), y^(i)}_{i=1}^N
- Choose a decision function hθ(x) parameterized by θ
- Choose an objective function J_D relies on a data
- Learned by choosing parameters that optimizes the objective J(θ) θ=argmin J_D(θ)
- Predict on new test example x_new using h y=hθ(x_new)
- Perceptron: hθ(x) = sign(θ^Tx)
- Linear Regression: hθ(x) = θ^Tx
- Discriminative Models: hθ(x) = argmax pθ(y|x)
- Logistic Regression p(y=1|x) = σ(θ^Tx)
- Generative Models: hθ(x) = argmax pθ(y|x)
- Naive Bayes p(x,y) = p(y) Π_{m=1}^M p(x_m|y)
- Neural Network for classification: p(y|x) = σ(W^(2)σ(W^(1)^T+b^(1))+b^(2))
- Maximum Likelihood Estimation (MLE): J(θ) = -Σ_{i=1}^N log(p(x^(i), y^(i)))
- Maximum Conditional Likelihood Estimation (MCLE): J(θ) = -Σ_{i=1}^N log(p(y^(i) | x^(i)))
- L2 Regularizer: J'(θ) = J(θ) + λ||θ||_2^2
- L1 Regularizer: J'(θ) = J(θ) + λ||θ||_1
- Gradient Descent
- SGD where J(θ) = (1/N) Σ J^(i)(θ)
- θ ← θ - γ▽J^(i)(θ) for i ~ Uniform({1, ..., N})
- Mini-batch SGD
- Closed Form
- Compute partial derivatives
- set to zero and solve