Skip to content

daniel-yj-yang/ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning

Learning the (hidden or obvious) mapping of the world via a sample of data and computer algorithms that optimize parameters through training.

In supervised ML, the learned model can use information gained in features to reduce uncertainty when guessing/predicting the target label, namely, identifying data signals relevant for the new, unseen.

It is different from automation.


Algorithms / models:

Algorithm / model Type Use case Online demo / example
Association Rules Unsupervised/Supervised To identify items frequently bought together in transactional data; to perform market basket / affinity analysis Demo: Generating association rules with transactions data (*interactive*)
Neural Network Unsupervised/Supervised To understand how similar products are in order to design a campaign Example: R
Deep Neural Network: Softmax Unsupervised/Supervised To capture personalized preferences for a latent factor model for recommendations;
To detect fraud transactions
Example: see collaborative filtering
Collaborative Filtering Unsupervised To recommend an item to a buyer because (a) similar buyers purchased it and (b) the user purchased similar item(s) Examples: Python, R
Content-Based Filtering Unsupervised To recommend an item to a buyer because the item strongly fits the user's preference Example: Illustration
Clustering Unsupervised To understand the grouping of consumers with respect to their purchase habits Examples: Python, R
PCA Unsupervised (a) To summarize data on a 2D map;
(b) To reconstruct data using PCs
Examples: Clojure, Python, R
t-SNE Unsupervised To visualize data consisting of legitimate and fraudulent transactions on a 2D map Examples: Python, R
UMAP Unsupervised To visualize higher dimensional data on a 2D map Examples: Python, R
Network Analysis Unsupervised To understand the dynamics of how purchasing one item may affect purchasing another Examples: Python, R
Bayesian/Probabilities Networks Supervised To predict the chain of events linking to greater likelihood of consumer purchasing Example: R
k-Nearest Neighbors Supervised To predict what product a new customer may like, given the customer's characteristics Examples: Python, R
Support Vector Machine (SVM) Supervised To predict consumer's dichotomous purchasing decision Examples: Python, R
Naive Bayes Supervised To predict consumer's dichotomous purchasing decision Examples: Python
Linear Regression Supervised To explain sales via advertising budget Examples: Python, R
Logistic Regression Supervised To predict consumer's dichotomous purchasing decision (1) Example: Python; (2) Demo: Running logistic regression with retail data (*interactive*)
Decision Tree Supervised To predict consumer's decision of purchasing Example: Decision trees of consumer purchasing

Assumptions:

Algorithm / model (selected) Assumptions
Associate Rules (e.g., apriori) 1. All subsets of a frequent itemset are frequent.
Decision Trees 1. The data can be described by features.
2. The class label can be predicted using the logic set of decisions in a decision tree.
3. Effectiveness can be achieved by finding a smaller tree with lower error.
Neural Networks As opposed to real neurons:
1. Nodes connect to each other sequentially via distinct layers.
2. Nodes within the same layer do not communicate with each other.
3. Nodes of the same layer have the same activation functions.
4. Input nodes only communicate indirectly with output nodes via the hidden layer.
K-means clustering 1. The clusters are spherical.
2. The clusters are of similar size.
Naive Bayes 1. Every pair of feature variables is independent of each other.
2. The contribution each feature makes to the target variable is equal.
Logistic Regression 1. DV is binary or ordinal.
2. Observations are independent of each other.
3. Little or no multicollinearity among the IV.
4. Linearity of IV (the X) and log odds (the z).
5. A large sample size. It needs at minimum of 10 cases with the least frequent DV for each IV.
6. There is no influential values (extreme values or outliers) in the continuous IV.
Linear Regression 1. Linearity: The relationship between X and Y is linear.
2. Independence: Residual -- Y is independent of the residuals.
3. Homoscedasticity: Residual -- variance of the residuals is the same for all values of X.
4. Normality: Residual -- residual is normally distributed.
(2-4 are also known as IID: residuals are Independently, Identically Distributed as normal).
5. No or little multicollinearity among X's (for Multiple Linear Regression).

Algorithm Selection:

It dependes on several factors, including (a) the nature of the data, (b) the goal of the analysis, (c) the relative performance of the algorithm, and (d) the possibility to integrate with business & operations.

Factors Details
Nature of the data Categorical, continuous, etc.
Goal of analysis * To describe, estimate, predict, cluster, classify, associate, explain, etc.
* For example, decision trees are more readily interpretable than neural networks
Algorithm performance/
Model evaluation
* For classification, predictive power can be assessed via the area under ROC
* For regression, there are a variety of choices, including R2, AIC, RMSE
Business integration * Data availability
* Model tuning vs. new model
* Thinking through IT integration at the beginning of the project
* Business end users' actual uses