Skip to content

Commit

Permalink
Updated Lec 11
Browse files Browse the repository at this point in the history
  • Loading branch information
conan-sw committed Feb 27, 2025
1 parent 4ace138 commit e420602
Show file tree
Hide file tree
Showing 59 changed files with 2,465 additions and 1,570 deletions.
784 changes: 784 additions & 0 deletions constant_model_loss_transformations/loss_transformations.ipynb

Large diffs are not rendered by default.

51 changes: 26 additions & 25 deletions constant_model_loss_transformations/loss_transformations.qmd
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: 'Constant Model, Loss, and Transformations (Old Notes from Fall 2024)'
title: Constant Model, Loss, and Transformations
execute:
echo: true
format:
html:
code-fold: true
code-tools: true
toc: true
toc-title: 'Constant Model, Loss, and Transformations'
toc-title: Constant Model, Loss, and Transformations
page-layout: full
theme:
- cosmo
Expand All @@ -21,7 +21,7 @@ jupyter:
format_version: '1.0'
jupytext_version: 1.16.1
kernelspec:
display_name: ds100env
display_name: data100quarto
language: python
name: python3
---
Expand Down Expand Up @@ -53,14 +53,6 @@ At the end of last lecture, we dived deeper into step 4 - evaluating model perfo

Before we get into the modeling process, let's quickly review some important terminology.

### Prediction vs. Estimation

The terms prediction and estimation are often used somewhat interchangeably, but there is a subtle difference between them. **Estimation** is the task of using data to calculate model parameters. **Prediction** is the task of using a model to predict outputs for unseen data. In our simple linear regression model,

$$\hat{y} = \hat{\theta_0} + \hat{\theta_1}x$$

we **estimate** the parameters by minimizing average loss; then, we **predict** using these estimations. **Least Squares Estimation** is when we choose the parameters that minimize MSE.

## Constant Model + MSE

Now, we'll shift from the SLR model to the **constant model**, also known as a summary statistic. The constant model is slightly different from the simple linear regression model we've explored previously. Rather than generating predictions from an inputted feature variable, the constant model always *predicts the same constant number*. This ignores any relationships between variables. For example, let's say we want to predict the number of drinks a boba shop sells in a day. Boba tea sales likely depend on the time of year, the weather, how the customers feel, whether school is in session, etc., but the constant model ignores these factors in favor of a simpler model. In other words, the constant model employs a **simplifying assumption**.
Expand Down Expand Up @@ -95,14 +87,12 @@ We can **fit the model** by finding the optimal $\hat{\theta_0}$ that minimizes

1. Differentiate with respect to $\theta_0$:

$$
\begin{align}
\frac{d}{d\theta_0}\text{R}(\theta) & = \frac{d}{d\theta_0}(\frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2)
\\ &= \frac{1}{n}\sum^{n}_{i=1} \frac{d}{d\theta_0} (y_i - \theta_0)^2 \quad \quad \text{a derivative of sum is a sum of derivatives}
\\ &= \frac{1}{n}\sum^{n}_{i=1} 2 (y_i - \theta_0) (-1) \quad \quad \text{chain rule}
\\ &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \theta_0) \quad \quad \text{simplify constants}
\end{align}
$$

2. Set the derivative equation equal to 0:

Expand All @@ -112,7 +102,6 @@ $$

3. Solve for $\hat{\theta_0}$

$$
\begin{align}
0 &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \hat{\theta_0})
\\ &= \sum^{n}_{i=1} (y_i - \hat{\theta_0}) \quad \quad \text{divide both sides by} \frac{-2}{n}
Expand All @@ -122,7 +111,23 @@ $$
\\ \hat{\theta_0} &= \frac{1}{n} \sum^{n}_{i=1} y_i
\\ \hat{\theta_0} &= \bar{y}
\end{align}
$$

::: {.callout-note}
The **mean** of the outcomes achieves the minimum MSE of the constant model. Minimum MSE is the **sample variance** of $y$

\begin{align}
R(\hat{\theta_0}) & = R(\bar{y}) \\
& = \frac{1}{n}\sum_{i=1}^{n}(y_i-\bar{y})^2\\
&= \sigma_y^2
\end{align}

$R^2$ is interpreted as the fraction of **variance in $y$** "explained" by a linear model.

\begin{align}
R^2 &= \frac{\Delta \text{ in area}}{\text{Constant model area}}\\\\
&= \frac{\text{MSE}_\text{constant model} - \text{MSE}_\text{linear model}}{\text{MSE}_\text{constant model}}
\end{align}
:::

Let's take a moment to interpret this result. $\hat{\theta_0} = \bar{y}$ is the optimal parameter for constant model + MSE.
It holds true regardless of what data sample you have, and it provides some formal reasoning as to why the mean is such a common summary statistic.
Expand All @@ -133,11 +138,11 @@ $$R(\hat{\theta_0}) = \min_{\theta_0} R(\theta_0)$$

To restate the above in plain English: we are looking at the value of the cost function when it takes the best parameter as input. This optimal model parameter, $\hat{\theta_0}$, is the value of $\theta_0$ that minimizes the cost $R$.

For modeling purposes, we care less about the minimum value of cost, $R(\hat{\theta_0})$, and more about the *value of $\theta$* that results in this lowest average loss. In other words, we concern ourselves with finding the best parameter value such that:
For modeling purposes, we care less about the minimum value of cost, $R(\hat{\theta_0})$, and more about the *value of $\theta_0$* that results in this lowest average loss. In other words, we concern ourselves with finding the best parameter value such that:

$$\hat{\theta} = \underset{\theta}{\operatorname{\arg\min}}\:R(\theta)$$
$$\hat{\theta_0} = \underset{\theta_0}{\operatorname{\arg\min}}\:R(\theta_0)$$

That is, we want to find the **arg**ument $\theta$ that **min**imizes the cost function.
That is, we want to find the **arg**ument $\theta_0$ that **min**imizes the cost function.

### Comparing Two Different Models, Both Fit with MSE

Expand Down Expand Up @@ -343,13 +348,11 @@ To fit the model, we find the optimal parameter value $\hat{\theta_0}$ that mini

1. Differentiate with respect to $\hat{\theta_0}$:

$$
\begin{align}
\hat{R}(\theta_0) &= \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| \\
\frac{d}{d\theta_0} R(\theta_0) &= \frac{d}{d\theta_0} \left(\frac{1}{n} \sum^{n}_{i=1} |y_i - \theta_0| \right) \\
&= \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0|
\end{align}
$$

- Here, we seem to have run into a problem: the derivative of an absolute value is undefined when the argument is 0 (i.e. when $y_i = \theta_0$). For now, we'll ignore this issue. It turns out that disregarding this case doesn't influence our final result.
- To perform the derivative, consider two cases. When $\theta_0$ is *less than or equal to* $y_i$, the term $y_i - \theta_0$ will be positive and the absolute value has no impact. When $\theta_0$ is *greater than* $y_i$, the term $y_i - \theta_0$ will be negative. Applying the absolute value will convert this to a positive value, which we can express by saying $-(y_i - \theta_0) = \theta_0 - y_i$.
Expand Down Expand Up @@ -467,7 +470,7 @@ To summarize our example,
| Outliers | **Sensitive** to outliers (since they change mean substantially). Sensitivity also depends on the dataset size. | **More robust** to outliers. |
| $\hat{\theta_0}$ Uniqueness | **Unique** $\hat{\theta_0}$ | **Infinitely many** $\hat{\theta_0}$s |

## Transformations to Fit Linear Models
## Transformations of Linear Models

At this point, we have an effective method of fitting models to predict linear relationships. Given a feature variable and target, we can apply our four-step process to find the optimal model parameters.

Expand Down Expand Up @@ -584,23 +587,22 @@ Earlier, we calculated the constant model MSE using calculus. It turns out that
In this calculation, we use the fact that the **sum of deviations from the mean is $0$** or that $\sum_{i=1}^{n} (y_i - \bar{y}) = 0$.

Let's quickly walk through the proof for this:
$$

\begin{align}
\sum_{i=1}^{n} (y_i - \bar{y}) &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \bar{y} \\
&= \sum_{i=1}^{n} y_i - n\bar{y} \\
&= \sum_{i=1}^{n} y_i - n\frac{1}{n}\sum_{i=1}^{n}y_i \\
&= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n}y_i \\
& = 0
\end{align}
$$


In our calculations, we'll also be using the definition of the variance as a sample. As a refresher:

$$\sigma_y^2 = \frac{1}{n}\sum_{i=1}^{n} (y_i - \bar{y})^2$$

Getting into our calculation for MSE minimization:

$$
\begin{align}
R(\theta) &= {\frac{1}{n}}\sum^{n}_{i=1} (y_i - \theta)^2
\\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y}) + (\bar{y} - \theta)]^2\quad \quad \text{using trick that a-b can be written as (a-c) + (c-b) } \\
Expand All @@ -610,7 +612,6 @@ R(\theta) &= {\frac{1}{n}}\sum^{n}_{i=1} (y_i - \theta)^2
\\ &= \frac{1}{n}\sum^{n}_{i=1}(y_i - \bar{y})^2 + \frac{2}{n}(\bar{y} - \theta)\cdot0 + (\bar{y} - \theta)^2 \quad \quad \text{sum of deviations from mean is 0}
\\ &= \sigma_y^2 + (\bar{y} - \theta)^2
\end{align}
$$

Since variance can't be negative, we know that our first term, $\sigma_y^2$ is greater than or equal to $0$. Also note, that **the first term doesn't involve $\theta$ at all**, meaning changing our model won't change this value. For the purposes of determining $\hat{\theta}#, we can then essentially ignore this term.

Expand Down
Loading

0 comments on commit e420602

Please sign in to comment.