Skip to content

An R package that simplifies running machine learning algorithms using the tidymodels framework. Includes classification and regression models.

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

amanda-park/easytidymodels

Repository files navigation

easytidymodels

The goal of easytidymodels is to make running analyses in R using the tidymodels framework both easier and more reproducible. This is a wrapper for the tidymodels packages so that, after your data pre-processing steps, it all runs in one line of code and automatically tunes all the hyperparameters that are offered.

If you are not familiar with tidymodels, I recommend learning more here or here.

For more details on how the functions work in this package, I recommend checking out the reference page, referencing the vignettes on this site, or calling help on the function of interest in R to learn more. Here I will just give a brief overview of the workflow of this package.

Installation

You can install easytidymodels like this:

# install.packages("devtools")
devtools::install_github("amanda-park/easytidymodels")

Preparing Data for Analysis

There are three main functions to prepare your data for analysis:

  • trainTestSplit lets you split data into training and testing sets, with the ability to stratify on a variable and split based on a point in time.
  • cvFolds splits your data into cross-validation folds to allow the model’s hyperparameters to be tuned.
  • createRecipe does some basic data preprocessing on your dataset. NOTE: I recommend calling recipe() and creating a recipe object specific to your dataset’s needs, as every dataset will require its own preprocessing prior to analysis.

Classification Functions

The binary classification machine learning models available are as follows:

  • XGBoost (function xgBinaryClassif)
  • Logistic Regression (function logRegBinary)
  • K-Nearest Neighbors (function knnClassif)
  • Support Vector Machine (function svmClassif)

The multiclass classifications available are as follows:

  • XGBoost (function xgMultiClassif)
  • Multinomial Regression (function logRegMulti)
  • K-Nearest Neighbors (function knnClassif)
  • Support Vector Machine (function svmClassif)

Each of these models will tune the appropriate hyperparameters in the mode. However, these models allow for optimizing hyperparameters based on a specific evaluation metric. The list of metrics are as follows:

  • Balanced Accuracy (Average of Sensitivity and Specificity, call “bal_accuracy”)
  • Mean Log Loss (Call “mn_log_loss”)
  • ROC AUC (Area Under the Receiver Operating Curve, call “roc_auc”)
  • MCC (Matthew’s Correlation Coefficient, call “mcc”)
  • Kappa (Normalized Accuracy, call “kap”)
  • Sensitivity (Call “sens”)
  • Specificity (Call “spec”)
  • Precision (Call “precision”)
  • Recall (Call “recall”)

Save the model output to an object; the model will return the following in a list (can be accessed using $):

  • Confusion matrix on training data
  • Accuracy evaluation on training data
  • Confusion matrix on testing data
  • Accuracy evaluation on testing data
  • Description of final model chosen
  • A tuned version of the model (in the case you want to try model stacking or seeing the optimal model fit based on a different evaluation metric)

Regression Functions

The regression functions available are as follows:

  • Random Forest (function rfRegress)
  • XGBoost (function xgRegress)
  • Linear Regression (function linearRegress)
  • MARS (function marsRegress)
  • K-Nearest Neighbor Regression (function knnRegress)
  • Support Vector Machine Regression (function svmRegress)

These models allow for optimizing hyperparameters based on a specific evaluation metric as well. The list of metrics are as follows:

  • RMSE (Root Mean Squared Error, call “rmse”)
  • MAE (Mean Absolute Error, call “mae”)
  • RSQ (R-Squared, call “rsq”)
  • MASE (Mean Absolute Scaled Error, call “mase”)
  • CCC (Concordance Correlation Coefficient, call “ccc”)
  • IIC (Index of Ideality of Correlation, call “iic”)
  • HUBER_LOSS (Huber loss, call “huber_loss”)

Save the model output to an object; the model will return the following in a list (can be accessed using $):

  • Predictions on training data
  • RMSE and MAE evaluation on training data
  • Predictions on testing data
  • RMSE and MAE evaluation on testing data
  • Description of final model chosen
  • A tuned version of the model (in the case you want to try model stacking or seeing the optimal model fit based on a different evaluation metric)