Skip to content

Latest commit

 

History

History

linear_regression

Linear Regression

To explain quarterly sales via other explanatory variables


Examples

Example Details
Simple Linear Regression To explain sales via the budget of youtube and facebook advertising, respectively:
Multiple Linear Regression Here are the comprehensive results of my R code that run a multiple linear regression of sales on the budget of three advertising medias (youtube, facebook and newspaper)

Models

Model Hypothesis, hθ(x) Notes
Simple Linear Regression θTx = θ0x0 + θ1x1
Conventionally, x0 = 1
assumes linearity in the relationship between x and y, and I.I.D. in residuals
Multiple Linear Regression θTx = θ0x0 + θ1x1 + θ2x2 + ⋯ + θnxn
Conventionally, x0 = 1
all assumptions in simple linear regression, while also assuming little or no multicollinearity
Polynomial Regression
FunctionMath Expression, y ~ f(x,θ)
Quadraticθ0x0 + θ1x1 + θ2〖x12
A "circle"θ0x0 + θ1x1 + θ2x2 + θ3〖x12 + θ4〖x22
Cubicθ0x0 + θ1x1 + θ2〖x12 + θ3〖x13
Square rootθ0x0 + θ1x1 + θ2〖x10.5
Other higher-order polynomial funcθ0x0 + θ1x1 + θ2x2 + θ3x1x2 etc.
Conventionally, x0 = 1
considered linear regression since it is linear in the regression coefficients, although the decision boundary is non-linear

Estimation of coefficients/parameters

Method Details
To minimize cost function
To solve analytically using normal equation
X transpose times X inverse times X transpose times y

Implementation:
- Clojure (incanter): (mmult (mmult (solve (mmult (trans X) X)) (trans X)) y)
- Python*: inv(X.T.dot(X)).dot(X.T).dot(y)
- R: solve( t(X) %*% X ) %*% t(X) %*% y

*Note: from numpy.linalg import inv

Interpretation of coefficients

Coefficient Interpretation
Unstandardized It represents the amount by which dependent variable changes if we change independent variable by one unit keeping other independent variables constant.
Standardized The standardized coefficient is measured in units of standard deviation (both X and Y). A beta value of 1.25 indicates that a change of one standard deviation in the independent variable results in a 1.25 standard deviations increase in the dependent variable.

Regularized linear regression

Model Penalty Description
Ridge regression
(To handle multicollinearity; details)

it is named "ridge" because the matrix form of Ip looks like a ridge
Using L2 Regularization Technique:

Adding “squared magnitude” of coefficient (squared L2 norm) as a penalty term to the cost function when there is multicollinearity among X's (which leads to overfitting of the data):


Reference: L2 norm:


R's glmnet(alpha=0) and minimize this objective function:


sklearn's Ridge() and minimize this objective function:
||y - Xw||^2_2 + alpha * ||w||^2_2

statsmodels' OLS.fit_regularized(L1_wt=0.0) and minimize this:
0.5*RSS/n + 0.5*alpha*|params|^2_2
* Multicollinearity may give rise to large variances of the coefficient estimates, which can be reduced by ridge regression;

* When (XTX)-1 does not exist, (XTX) being singular;
* Recall, the LS estimator:

* Such non-invertibility is due to (a) multicollinearity, or (b) # predictors > # observations;
* To fix, use:

* Note, we do not penalize the intercept term.
Lasso regression
(To perform feature selection, or to simplify model; details)
Using L1 Regularization Technique:

Adding “magnitude” of coefficient (L1 norm) as a penalty term to the cost function when the model is too complex and has trivial predictors (which leads to overfitting of the data):


Reference: L1 norm:


R's glmnet(alpha=1) and minimize this objective function:


sklearn's Lasso() and minimize this objective function:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

statsmodels' OLS.fit_regularized(L1_wt=1.0) and minimize this:
0.5*RSS/n + alpha*|params|_1
* To shrink some of the coefficients to exactly 0;
* This leads to a sparse model (having many coefficients = 0), helping with interpretability;
* Keeping more important predictors;
* Note, we do not penalize the intercept term.
* Note, the Lasso function is not differentiable.
Elastic net Lasso L1 penalty + Ridge L2 penalty ---

Example: Ridge Regression (alpha=0) and Lasso Regression (alpha=1)

With each regression, λ is the only parameter to adjust. When λ increases, coefficients tend to shrink but MSE tends to increase (see graphs below). Thus, there exists a sweet spot where coefficients shrink and MSE is also the lowest. Because the implementations in R and Python are different, the coefficients across different implementations may not be directly comparable; however, the MSE/RMSE/R2 are comparable.

Regression Coef vs log lambda MSE vs. log lambda
Ridge
Lasso
Model performance
with the testing set
Linear regression Ridge regression Lasso
RMSE 418.3987 374.5406 380.2771
R2 0.2209 0.3757 0.3564
Coefficient

Compared to linear regression, both ridge and lasso regression appear to have improved the model performance.

My own codes: R and Python