Name		Name	Last commit message	Last commit date
parent directory ..
images		images
README.md		README.md
linear_regression.R		linear_regression.R
multiple_regression.md		multiple_regression.md
regularized_linear_regression.R		regularized_linear_regression.R
regularized_linear_regression.py		regularized_linear_regression.py

README.md

Linear Regression

To explain quarterly sales via other explanatory variables

Examples

Example

Details

Simple Linear Regression

To explain sales via the budget of youtube and facebook advertising, respectively:

Multiple Linear Regression

Here are the comprehensive results of my R code that run a multiple linear regression of sales on the budget of three advertising medias (youtube, facebook and newspaper)

Models

Model

Hypothesis, h_θ(x)

Notes

Simple Linear Regression

θ^Tx = θ₀x₀ + θ₁x₁
Conventionally, x₀ = 1

assumes linearity in the relationship between x and y, and I.I.D. in residuals

Multiple Linear Regression

θ_Tx = θ₀x₀ + θ₁x₁ + θ₂x₂ + ⋯ + θ_nx_n
Conventionally, x₀ = 1

all assumptions in simple linear regression, while also assuming little or no multicollinearity

Polynomial Regression

Function	Math Expression, y ~ f(x,θ)
Quadratic	θ₀x₀ + θ₁x₁ + θ₂〖x₁〗²
A "circle"	θ₀x₀ + θ₁x₁ + θ₂x₂ + θ₃〖x₁〗² + θ₄〖x₂〗²
Cubic	θ₀x₀ + θ₁x₁ + θ₂〖x₁〗² + θ₃〖x₁〗³
Square root	θ₀x₀ + θ₁x₁ + θ₂〖x₁〗^0.5
Other higher-order polynomial func	θ₀x₀ + θ₁x₁ + θ₂x₂ + θ₃x₁x₂ etc.

Conventionally, x₀ = 1

considered linear regression since it is linear in the regression coefficients, although the decision boundary is non-linear

Estimation of coefficients/parameters

Method	Details
To minimize cost function
To solve analytically using normal equation	X transpose times X inverse times X transpose times y Implementation: - Clojure (incanter): `(mmult (mmult (solve (mmult (trans X) X)) (trans X)) y)` - Python: `inv(X.T.dot(X)).dot(X.T).dot(y)` - R: `solve( t(X) %% X ) %% t(X) %% y` *Note: `from numpy.linalg import inv`

Interpretation of coefficients

Coefficient	Interpretation
Unstandardized	It represents the amount by which dependent variable changes if we change independent variable by one unit keeping other independent variables constant.
Standardized	The standardized coefficient is measured in units of standard deviation (both X and Y). A beta value of 1.25 indicates that a change of one standard deviation in the independent variable results in a 1.25 standard deviations increase in the dependent variable.

Regularized linear regression

Model	Penalty	Description
Ridge regression (To handle multicollinearity; details) it is named "ridge" because the matrix form of I_p looks like a ridge	Using L2 Regularization Technique: Adding “squared magnitude” of coefficient (squared L2 norm) as a penalty term to the cost function when there is multicollinearity among X's (which leads to overfitting of the data): Reference: L2 norm: R's glmnet(alpha=0) and minimize this objective function: sklearn's Ridge() and minimize this objective function: \|\|y - Xw\|\|^2_2 + alpha * \|\|w\|\|^2_2 statsmodels' OLS.fit_regularized(L1_wt=0.0) and minimize this: 0.5RSS/n + 0.5alpha*\|params\|^2_2	* Multicollinearity may give rise to large variances of the coefficient estimates, which can be reduced by ridge regression; * When (X^TX)^-1 does not exist, (X^TX) being singular; * Recall, the LS estimator: * Such non-invertibility is due to (a) multicollinearity, or (b) # predictors > # observations; * To fix, use: * Note, we do not penalize the intercept term.
Lasso regression (To perform feature selection, or to simplify model; details)	Using L1 Regularization Technique: Adding “magnitude” of coefficient (L1 norm) as a penalty term to the cost function when the model is too complex and has trivial predictors (which leads to overfitting of the data): Reference: L1 norm: R's glmnet(alpha=1) and minimize this objective function: sklearn's Lasso() and minimize this objective function: (1 / (2 * n_samples)) * \|\|y - Xw\|\|^2_2 + alpha * \|\|w\|\|_1 statsmodels' OLS.fit_regularized(L1_wt=1.0) and minimize this: 0.5RSS/n + alpha\|params\|_1	* To shrink some of the coefficients to exactly 0; * This leads to a sparse model (having many coefficients = 0), helping with interpretability; * Keeping more important predictors; * Note, we do not penalize the intercept term. * Note, the Lasso function is not differentiable.
Elastic net	Lasso L1 penalty + Ridge L2 penalty	---

Model

Penalty

Description

Ridge regression
(To handle multicollinearity; details)

it is named "ridge" because the matrix form of I_p looks like a ridge

Using L2 Regularization Technique:

Adding “squared magnitude” of coefficient (squared L2 norm) as a penalty term to the cost function when there is multicollinearity among X's (which leads to overfitting of the data):

Reference: L2 norm:

R's glmnet(alpha=0) and minimize this objective function:

sklearn's Ridge() and minimize this objective function:
||y - Xw||^2_2 + alpha * ||w||^2_2

statsmodels' OLS.fit_regularized(L1_wt=0.0) and minimize this:
0.5*RSS/n + 0.5*alpha*|params|^2_2

* Multicollinearity may give rise to large variances of the coefficient estimates, which can be reduced by ridge regression;

* When (X^TX)^-1 does not exist, (X^TX) being singular;
* Recall, the LS estimator:

* Such non-invertibility is due to (a) multicollinearity, or (b) # predictors > # observations;
* To fix, use:

* Note, we do not penalize the intercept term.

Lasso regression
(To perform feature selection, or to simplify model; details)

Using L1 Regularization Technique:

Adding “magnitude” of coefficient (L1 norm) as a penalty term to the cost function when the model is too complex and has trivial predictors (which leads to overfitting of the data):

Reference: L1 norm:

R's glmnet(alpha=1) and minimize this objective function:

sklearn's Lasso() and minimize this objective function:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

statsmodels' OLS.fit_regularized(L1_wt=1.0) and minimize this:
0.5*RSS/n + alpha*|params|_1

* To shrink some of the coefficients to exactly 0;
* This leads to a sparse model (having many coefficients = 0), helping with interpretability;
* Keeping more important predictors;
* Note, we do not penalize the intercept term.
* Note, the Lasso function is not differentiable.

Elastic net

Lasso L1 penalty + Ridge L2 penalty

---

Example: Ridge Regression (alpha=0) and Lasso Regression (alpha=1)

With each regression, λ is the only parameter to adjust. When λ increases, coefficients tend to shrink but MSE tends to increase (see graphs below). Thus, there exists a sweet spot where coefficients shrink and MSE is also the lowest. Because the implementations in R and Python are different, the coefficients across different implementations may not be directly comparable; however, the MSE/RMSE/R² are comparable.