To explain quarterly sales via other explanatory variables
Example | Details | ||
---|---|---|---|
Simple Linear Regression | To explain sales via the budget of youtube and facebook advertising, respectively: |
||
Multiple Linear Regression | Here are the comprehensive results of my R code that run a multiple linear regression of sales on the budget of three advertising medias (youtube, facebook and newspaper) |
Model | Hypothesis, hθ(x) | Notes | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Simple Linear Regression | θTx = θ0x0 + θ1x1 Conventionally, x0 = 1 |
assumes linearity in the relationship between x and y, and I.I.D. in residuals | ||||||||||||
Multiple Linear Regression | θTx = θ0x0 + θ1x1 + θ2x2 + ⋯ + θnxn Conventionally, x0 = 1 |
all assumptions in simple linear regression, while also assuming little or no multicollinearity | ||||||||||||
Polynomial Regression |
|
considered linear regression since it is linear in the regression coefficients, although the decision boundary is non-linear |
Method | Details |
---|---|
To minimize cost function | |
To solve analytically using normal equation | X transpose times X inverse times X transpose times y Implementation: - Clojure (incanter): (mmult (mmult (solve (mmult (trans X) X)) (trans X)) y) - Python*: inv(X.T.dot(X)).dot(X.T).dot(y) - R: solve( t(X) %*% X ) %*% t(X) %*% y *Note: from numpy.linalg import inv |
Coefficient | Interpretation |
---|---|
Unstandardized | It represents the amount by which dependent variable changes if we change independent variable by one unit keeping other independent variables constant. |
Standardized | The standardized coefficient is measured in units of standard deviation (both X and Y). A beta value of 1.25 indicates that a change of one standard deviation in the independent variable results in a 1.25 standard deviations increase in the dependent variable. |
Model | Penalty | Description |
---|---|---|
Ridge regression (To handle multicollinearity; details) it is named "ridge" because the matrix form of Ip looks like a ridge |
Using L2 Regularization Technique: Adding “squared magnitude” of coefficient (squared L2 norm) as a penalty term to the cost function when there is multicollinearity among X's (which leads to overfitting of the data): Reference: L2 norm: R's glmnet(alpha=0) and minimize this objective function: sklearn's Ridge() and minimize this objective function: ||y - Xw||^2_2 + alpha * ||w||^2_2 statsmodels' OLS.fit_regularized(L1_wt=0.0) and minimize this: 0.5*RSS/n + 0.5*alpha*|params|^2_2 |
* Multicollinearity may give rise to large variances of the coefficient estimates, which can be reduced by ridge regression; * When (XTX)-1 does not exist, (XTX) being singular; * Recall, the LS estimator: * Such non-invertibility is due to (a) multicollinearity, or (b) # predictors > # observations; * To fix, use: * Note, we do not penalize the intercept term. |
Lasso regression (To perform feature selection, or to simplify model; details) |
Using L1 Regularization Technique: Adding “magnitude” of coefficient (L1 norm) as a penalty term to the cost function when the model is too complex and has trivial predictors (which leads to overfitting of the data): Reference: L1 norm: R's glmnet(alpha=1) and minimize this objective function: sklearn's Lasso() and minimize this objective function: (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 statsmodels' OLS.fit_regularized(L1_wt=1.0) and minimize this: 0.5*RSS/n + alpha*|params|_1 |
* To shrink some of the coefficients to exactly 0; * This leads to a sparse model (having many coefficients = 0), helping with interpretability; * Keeping more important predictors; * Note, we do not penalize the intercept term. * Note, the Lasso function is not differentiable. |
Elastic net | Lasso L1 penalty + Ridge L2 penalty | --- |
With each regression, λ is the only parameter to adjust. When λ increases, coefficients tend to shrink but MSE tends to increase (see graphs below). Thus, there exists a sweet spot where coefficients shrink and MSE is also the lowest. Because the implementations in R and Python are different, the coefficients across different implementations may not be directly comparable; however, the MSE/RMSE/R2 are comparable.
Regression | Coef vs log lambda | MSE vs. log lambda |
---|---|---|
Ridge | ||
Lasso |
Model performance with the testing set |
Linear regression | Ridge regression | Lasso |
---|---|---|---|
RMSE | 418.3987 | 374.5406 | 380.2771 |
R2 | 0.2209 | 0.3757 | 0.3564 |
Coefficient |
Compared to linear regression, both ridge and lasso regression appear to have improved the model performance.