Skip to content

daniel-yj-yang/machlearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BuildTest PythonVersion PyPi_version Downloads License

A Simple Yet Powerful Machine Learning Python Library

Install

pip install machlearn

Example 1: k-Nearest Neighbors

from machlearn import kNN
kNN.demo("iris")
kNN.demo_from_scratch("iris")

Selected Output:

This demo uses a public dataset of Fisher's Iris, which has a total of 150 samples from three species of Iris ('setosa', 'versicolor', 'virginica').
The goal is to use 'the length and the width of the sepals and petals, in centimeters', to predict which species of Iris the sample belongs to.

Using a grid search and a kNN classifier, the best hyperparameters were found as following:
   Step1: scaler: StandardScaler(with_mean=True, with_std=True);
   Step2: classifier: kNN_classifier(n_neighbors=12, weights='uniform', p=2.00, metric='minkowski').

image_dataset_iris image_kNN_iris_confusion_matrix


Example 2: Naive Bayes

from machlearn import naive_bayes as nb
nb.demo_from_scratch()
nb.demo(dataset="SMS_spam")

Selected Output from nb.demo(dataset="SMS_spam"):

This demo uses a public dataset of SMS spam, which has a total of 5574 messages = 4827 ham (legitimate) and 747 spam.
The goal is to use 'term frequency in message' to predict whether the message is ham (class=0) or spam (class=1).

Using a grid search and a multinomial naive bayes classifier, the best hyperparameters were found as following:
   Step1: Tokenizing text: CountVectorizer(analyzer = <_lemmas>, ngram_range = (1, 1));
   Step2: Transforming from occurrences to frequency: TfidfTransformer(use_idf = True).

The top 2 terms with highest probability of a message being a spam (the classification is either spam or ham):
   "claim": 81.28%
   "prize": 80.24%
   "won": 76.29%

Application example:
   - Message: "URGENT! We are trying to contact U. Todays draw shows that you have won a 2000 prize GUARANTEED. Call 090 5809 4507 from a landline. Claim 3030. Valid 12hrs only."
   - Probability of spam (class=1): 95.85%
   - Classification: spam

image_SMS_spam_text_example image_naive_bayes_confusion_matrix

image_naive_bayes_ROC_curve image_naive_bayes_PR_curve

Example 3: Decision Boundary Comparison (Classification with Two Features)

from machlearn import kNN
kNN.demo("Social_Network_Ads")

from machlearn import naive_bayes as nb
nb.demo("Social_Network_Ads")

from machlearn import SVM
SVM.demo("Social_Network_Ads")

from machlearn import decision_tree as DT
DT.demo("Social_Network_Ads", classifier_func = "DT")

from machlearn import logistic_regression as logreg
logreg.demo("Social_Network_Ads")

from machlearn import neural_network as NN
NN.demo("Social_Network_Ads")

from machlearn import ensemble
ensemble.demo("Social_Network_Ads")
image_kNN_decision_boundary_testing_set image_Gaussian_NB_decision_boundary_testing_set
image_SVM_decision_boundary_testing_set image_DT_decision_boundary_testing_set
image_logistic_regression_decision_boundary_testing_set image_NN_MLP_decision_boundary_testing_set
image_RFC_decision_boundary_testing_set image_GBM_decision_boundary_testing_set

Example 4: Imbalanced Data

from machlearn import imbalanced_data
imbalanced_data.demo()

Summary of output:

To mitigate the problem associated with class imbalance, downsampling the majority class (y=0) to match the minority case (y=1).

These are insensitive to class imbalance:
- Area Under ROC curve
- Geometric mean
- Matthew's Correlation Coefficient
- Recall, TPR
- Specificity, 1-FPR

These are sensitive to class imbalance:
- Area Under PR curve
- Accuracy
- F1 score
- Precision
Extreme Imbalanced Data Majority Downsampled to Match Minority Class
image_extreme_imbalanced_data_bar_chart image_balanced_data_bar_chart
image_extreme_imbalanced_data_confusion_matrix image_balanced_data_confusion_matrix
image_extreme_imbalanced_data_ROC_curve image_balanced_data_ROC_curve
image_extreme_imbalanced_data_PR_curve image_balanced_data_PR_curve

Example 5: Regularization

from machlearn import linear_regression as linreg
linreg.demo_regularization()

Summary of output:

Issues: (a) high multicollinearity and (b) too many features; these lead to overfitting and poor generalization.
- After L2 Regularization (Ridge regression), reduced variance among the coefficient estimates [more robust/stable estimates], and better R-squared and lower RMSE with the testing set [better generalization]
- After L1 Regularization (Lasso regression), coefficient estimates becoming 0 for relatively trivial features [a simpler model], and better R-squared and lower RMSE with the testing set [better generalization]

Example 6: Gradient Descent

from machlearn import gradient_descent as GD
GD.demo("Gender")

Summary of output:

This example uses a batch gradient descent (BGD) procedure, a cost function of logistic regression, 30,000 # iterations, a learning rate of 0.00025, and with Male (1, 0) as the target.
- Theta estimates of [const, Height (inch), Weight (lbs)]: [-0.00977953, -0.4779923, 0.19667817]
- Compared to estimates from statsmodels ([0.69254314, -0.49262002, 0.19834042]), the estimates associated with Height and Weight are very close
- Accuracy of prediction:  0.919
Descriptive statistics Batch Gradient Descent Training Loss vs. Epoch
image_Gender_pairplot image_Gender_batch_gradient_descent_training_loss_plot

image_Gender_batch_gradient_descent_training_cost_vs_theta_plot


Example 7: Decision Tree

from machlearn import decision_tree as DT
DT.demo()
DT.demo_from_scratch(question_type="regression") # dataset='boston'
DT.demo_from_scratch(question_type="classification") # dataset='Social_Network_Ads', X=not scaled, criterion=entropy, max_depth=2

Summary of output:

- DT.demo_from_scratch(question_type="regression") uses decision_tree_regressor_from_scratch()
- DT.demo_from_scratch(question_type="classification") provides results essentially identical to the tree graph below.

image_Social_Networks_Ad_DT_notscaled_entropy_maxdepth=2


Example 8: Ensemble Methods

from machlearn import ensemble
ensemble.demo()
ensemble.demo("Social_Network_Ads")
ensemble.demo("boston")

Summary of output:

- These demos call the following functions developed from scratch and reflect the inner workings of them:
* random_forest_classifier_from_scratch();
* adaptive_boosting_classifier_from_scratch();
* gradient_boosting_regressor_from_scratch() (see training history plot below): R_squared = 0.753, RMSE = 4.419

image_boston_GBM_loss_history_plot


Example 9: Assumption Testing

from machlearn import linear_regression as linreg
linreg.demo_assumption_test()

Summary of output:

The assumptions of linear regression include (1) linear relationship between X and y, (2) I.I.D. of the residuals (residuals are independently and identically distributed as normal), (3) little or no multicollinearity if multiple IVs.

Selected output:

image_linreg_assumption_test_linearity image_linreg_assumption_test_homoscedasticity

module: model_evaluation

function description
plot_ROC_and_PR_curves() plots both the ROC and the precision-recall curves, along with statistics
plot_ROC_curve() plots the ROC (Receiver Operating Characteristic) curve, along with statistics
plot_PR_curve() plots the precision-recall curve, along with statistics
plot_confusion_matrix() plots the confusion matrix, along with key statistics, and returns accuracy
demo_CV() provides a demo of cross validation in this module
demo() provides a demo of the major functions in this module

module: datasets

function description
public_dataset() returns a public dataset as specified (e.g., iris, SMS_spam, Social_Network_Ads)

module: kNN

function description
kNN_classifier_from_scratch() kNN classifier developed from scratch
demo_from_scratch() provides a demo of selected functions in this module
demo() provides a demo of selected functions in this module

module: naive_bayes

class/function description
Multinomial_NB_classifier_from_scratch() Multinomial NB classifier developed from scratch
demo_from_scratch() provides a demo of selected functions in this module
Gaussian_NB_classifier() when X are continuous variables
Multinomial_NB_classifier() when X are independent discrete variables with 3+ levels (e.g., term frequency in the document)
Bernoulli_NB_classifier() when X are independent binary variables (e.g., whether a word occurs in a document or not)
demo() provides a demo of selected functions in this module

module: SVM

function description
demo() provides a demo of selected functions in this module

module: decision_tree

class/function description
decision_tree_regressor_from_scratch() decision tree regressor developed from scratch
decision_tree_classifier_from_scratch() decision tree classifier developed from scratch
demo_from_scratch() provides a demo of selected functions in this module
decision_tree_regressor() decision tree regressor
decision_tree_classifier() decision tree classifier
demo() provides a demo of selected functions in this module

module: neural_network

function description
multi_layer_perceptron_classifier() multi-layer perceptron (MLP) classifier
rnn() recurrent neural network
demo() provides a demo of selected functions in this module

module: logistic_regression

function description
logistic_regression_sklearn() solutions using sklearn
logistic_regression_statsmodels() solutions using statsmodels
demo() provides a demo of selected functions in this module

module: linear_regression

function description
assumption_test() tests the assumptions of linear regression
lasso_regression() lasso_regression
ridge_regression() ridge_regression
linear_regression_normal_equation() linear_regression_normal_equation
linear_regression() linear_regression
demo() provides a demo of selected functions in this module
demo_regularization() provides a demo of selected functions in this module
demo_assumption_test() provides a demo of selected functions in this module

module: DSA

function description
demo() provides a demo of selected functions in this module

module: stats

function description
demo() provides a demo of selected functions in this module

module: pipeline

class/function description
demo() provides a demo of selected functions in this module

module: imbalanced_data

function description
demo() provides a demo of selected functions in this module

module: decomposition

function description
demo() provides a demo of selected functions in this module

module: gradient_descent

class/function description
logistic_regression_BGD_classifier() logistic_regression_BGD_classifier class
batch_gradient_descent() batch_gradient_descent class
demo() provides a demo of selected functions in this module

module: ensemble

class/function description
gradient_boosting_regressor_from_scratch() gradient boosting regressor developed from scratch
adaptive_boosting_classifier_from_scratch() adaptive boosting classifier developed from scratch
random_forest_classifier_from_scratch() random forest classifier developed from scratch
bagging_classifier_from_scratch() bagging classifier developed from scratch
gradient_boosting_classifier() gradient boosting classifier
adaptive_boosting_classifier() adaptive boosting classifier
random_forest_classifier() random forest classifier
bagging_classifier() bagging classifier
voting_classifier() voting classifier
demo() provides a demo of selected functions in this module