Online Shoppers Intent

Executive Summary:

We are assuming the position of data scientists at 'DataScienceDeals.com' a company that sells courses on Data Science. We have data related to the visitors that come to our site.

The objective of this project is to predict which visitors are likely to produce revenue. This will feed into our new 1:1 chat help service feature that will be able to aid potential customers.

Key Files:

Link to google slides presentation file
Online_Shoppers_Intent.pdf: PDF of presentation slides
'Online_Shoppers_Intent'.ipynb: Jupyter Notebook
online_shoppers_intention.csv: Raw CSV file with data

Methodology:

Executive Summary
Import Data
Data Cleansing
Data Exploration
4.1 Dataset Information
4.2 Feature Information
4.3 Screening for Categorical Variables
4.4 Pairwise Associations and correlation of Variables
Baseline Model
5.1 Split and transform the training and test data
5.2 Decision tree model
Exploring Improvements on Baseline Model
6.1 Grid Search CV for Decision tree with Entropy impurity
6.2 Logistic Regression Classifier
6.3 Ensemble Methods - Random Forests
Threshold Selection
Testing our Model
Conclusions
Recommendations

Key Findings:

The initial scores for the baseline decision tree model: F1: 0.5728, Accuracy: 0.8691, Roc_AUC: 0.7498.
Very small improvements were made using a combinatory hyperparameter optimisation.

Below we can get some idea of how important each feature is to the random forest model. I have included only the top 10 important features.

The top ten feature importances

PageValues, Score: 0.316
ExitRates, Score: 0.081
ProductRelated_Duration, Score: 0.08
ProductRelated, Score: 0.071
Administrative_Duration, Score: 0.052
BounceRates, Score: 0.051
Administrative, Score: 0.041
Informational_Duration, Score: 0.026
Month_Nov, Score: 0.019
Informational, Score: 0.019

In the image below we can observe that the optimal threshold value given associated costs of confusion matrix outcomes:

As you can see the optimal threshold is at a point that favours recall over precision. Which reflects the nature of our business as we want to cpture the highest proportion of customers that come to our site in the model predictions.

An out-of-box logistic regression model performed better than the baseline model, giving a roc_auc score of 0.90.

The random forest ensemble model returned evaluated best at the end of the project with a roc_auc score of 0.89 but was increased to 0.933 using hyperparameter optimisation through Random and Grid Search CV (6 iterations).

Conclusions:

Our model favours recall over precision so we can capture a higher proportion of customers to our site. The optimal threshold chosen reflects the costs involved for predicting false negatives and false positives where the cost of a false negative is higher than a false positive.

There was a slight increase in model performance when using ensemble models (Random Forest) over Logistic Regression. The Grid Search Parameter optimsation was paramount in obtaining the highest roc_auc score in our random forest model

Page Value was the most important feature according to the random forest model. It represents "the average value for a page that a user visited before landing on the goal page or completing an Ecommerce transaction (or both)."
Product related duration was second most important and represents the duration of time spent on product related pages. Exit rate was third most important : "For all pageviews to the page, Exit Rate is the percentage that were the last in the session".

A key component of our model was incoroporating the costs (penalties for getting predictions wrong or right) Once we selected the threshold (0.34) that took those costs into account we were able to obtain : a Precision score = 0.66 and Recall score = 0.72

Recommendations:

The logistic regression model could be explored further as it did have a high -Out of box- Roc_AUC score = 0.90

Feature selection could be implemented at some stage to reduce dimensionality and allow the model to work with more relevant data as an input. Only 9 of the 400+ features had an importance score of over 0.1

Based on the model performance we can go ahead and provide the feature engineer team with the data needed to build a chat box model. The Chat Box will certainly benefit from the results of this model but work can be done to increase the model performance.

Feedback to the team can be provided regarding the most important features according to the model. Specifically 'Page Values', 'Exit Rate' and 'Product Related Duration' among a few others would be worth considering to reverse engineer in pursuit of optimising the website and thus increasing the revenue for 'DataScienceDeals.com'

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
Online_Shoppers_Intent.ipynb		Online_Shoppers_Intent.ipynb
Online_Shoppers_Intent.pdf		Online_Shoppers_Intent.pdf
README.md		README.md
online_shoppers_intention.csv		online_shoppers_intention.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online Shoppers Intent

Executive Summary:

Key Files:

Methodology:

Key Findings:

The top ten feature importances

Conclusions:

Recommendations:

About

Releases

Packages

Languages

algakovic/Online_Shoppers_Intention

Folders and files

Latest commit

History

Repository files navigation

Online Shoppers Intent

Executive Summary:

Key Files:

Methodology:

Key Findings:

The top ten feature importances

Conclusions:

Recommendations:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages