-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathK_Mean_Clustering_for_Customer_Segmentation.Rmd
279 lines (212 loc) · 13.4 KB
/
K_Mean_Clustering_for_Customer_Segmentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "K-Mean Clustering for Customer Segmentation"
subtitle: ""
author: "Diego Usai"
date: '25 May 2019'
output:
html_document:
theme: readable
highlight: pygments
number_sections: false
fig.align: 'centre'
toc: true
toc_float: true
toc_depth : 4
font-family: Roboto
code_folding: none
keep_md: false
dpi: 300
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
out.width = "80%",
out.height = "100%",
fig.align = "center")
```
```{r, include=FALSE}
# turn off locale-specific sorting for messages in English
Sys.setlocale("LC_TIME", "C")
```
# Overview
[__Market segmentation__](https://en.wikipedia.org/wiki/Market_segmentation) refers to the process of dividing a consumer market of existing and/or potential customers into groups (or segments) based on shared attributes, interests, and behaviours.
For this mini-project I will use the popular [__K-Means clustering__](https://en.wikipedia.org/wiki/K-means_clustering) algorithm to segment customers based on their response to a series of __marketing campaigns__. The basic concept is that consumers who share common traits would respond to __marketing communication__ in a similar way so that companies can reach out for each group in a relevant and effective way.
### What is K-Means Clustering?
__K-Means clustering__ is part of the [__Unsupervised Learning__](https://en.wikipedia.org/wiki/Unsupervised_learning) modelling family, a set of techniques used to find patterns in data that has not been labelled, classified or categorized. As this method does not require to have a target for clustering, it can be of great help in the exploratory phase of customer segmentation.
The fundamental idea is that customers assigned to a group are as similar as possible, whereas customers belonging to different groups are as dissimilar as possible. Each cluster is represented by its `centre`, corresponding to the mean of elements assigned to the cluster.
To illustrate the principle, suppose that you have a set of elements like those in the picture below and want to classify them into __3 clusters__
```{r random points, out.width="450px", echo=FALSE}
knitr::include_graphics("img/random-points.png")
```
What K-Means would do for you is to group them up around the middle or `centre` of each cluster, represented here by the __"X"'s__, in a way that minimises the distance of each element to its `centre`
```{r ramdom points clusters, out.width="450px", echo=FALSE}
knitr::include_graphics("img/random-points-clustered.png")
```
So how does that help you to __better understand your customers__? Well, in this case you can use their __behavior__ (specifically, which offer they did or did not go for) as a way to grouping them up with similar minded customers. You can now look into each of those groups to unearth trends and patterns and use them for shaping future offers.
__On a slightly more technical note__, it's important to mention that there are many K-Means algorithms available ( _Hartigan-Wong_, _Lloyd_, _MacQueen_ to name but a few) but they all share the same basic concept: each element is assigned to a cluster so that it minimises the __sum of squares Euclidean Distance__ to the `centre` - a process also referred to as minimising the __total within-cluster sum of squares__ (`tot.withinss`).
### Loading the Packages
```{r packages}
library(tidyverse)
library(lubridate)
library(knitr)
library(readxl)
library(broom)
library(umap)
library(ggrepel)
library(plotly)
```
# The Data
The [dataset](https://media.wiley.com/product_ancillary/6X/11186614/DOWNLOAD/ch02.zip) comes from [John Foreman](http://www.john-foreman.com/blog)'s book, [Data Smart](http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html). It contains __sales promotion__ data for a __fictional wine retailer__ and includes details of _32 promotions_ (including wine variety, minimum purchase quantity, percentage discount, and country of origin) and a list of _100 customers_ and the promotions they responded to.
```{r offers}
offers_tbl <- read_excel('../00_data/WineKMC.xlsx', sheet = 'OfferInformation')
offers_tbl <- offers_tbl %>%
set_names(c('offer', 'campaign', 'varietal', 'min_qty_kg',
'disc_pct','origin', 'past_peak'))
kable(head(offers_tbl))
```
```{r transactions}
transac_tbl <- read_excel('../00_data/WineKMC.xlsx', sheet = 'Transactions')
transac_tbl <- transac_tbl %>%
set_names(c('customer', 'offer'))
kable(head(transac_tbl))
```
The data needs to be converted to a `User-Item format`(a.k.a. _Customer-Product matrix_), featuring _customers_ across the top and _offers_ down the side. The cells are populated with __0's__ and __1’s__, where __1’s__ indicate if a customer responded to a specific offer.
This type of matrix is also known as __binary rating matrix__ and _does NOT require normalisation_.
```{r user_item format}
wine_tbl <- transac_tbl %>%
left_join(offers_tbl) %>%
mutate(value = 1) %>%
spread(customer,value, fill = 0)
kable(head(wine_tbl))
```
# Clustering the Customers
The K-Means algorithm comes with the __stats__ package, one of the __Core System Libraries__ in R, and is fairly straightforward to use. I just need to pass a few parameters to the __kmeans()__ function.
```{r clustering }
user_item_tbl <- wine_tbl[,8:107]
set.seed(196) # for reproducibility of clusters visualisation with UMAP
kmeans_obj <- user_item_tbl %>%
kmeans(centers = 5, # number of clusters to divide customer list into
nstart = 100, # specify number of random sets to be chosen
iter.max = 50) # maximum number of iterations allowed -
```
I can quickly inspect the model with`glance()`from the `broom`package, which provides an summary of model-level statistics
```{r glance, collapse = T}
glance(kmeans_obj) %>% glimpse()
```
The one metric to really keep an eye on is the __total within-cluster sum of squares__ (or `tot.withinss`) as the optimal number of clusters is that which minimises the `tot.withinss`. So I want to fit the k-means model for different number of clusters and see where `tot.withinss` reaches its minimum.
First, I build a function for a set number of `centers` (4 in this case) and check that is working on `glance()`.
```{r}
kmeans_map <- function(centers = 4) {
user_item_tbl %>%
kmeans(centers = centers, nstart = 100, iter.max = 50)
}
```
```{r, collapse = T}
4 %>% kmeans_map() %>% glance()
```
Then, I create a __nested tibble__, which is a way of "nesting" columns inside a data frame. The great thing about nested data frames is that you can put essentially anything you want in them: lists, models, data frames, plots, etc!
```{r}
kmeans_map_tbl <- tibble(centers = 1:15) %>% # create column with centres
mutate(k_means = centers %>%
map(kmeans_map)) %>% # iterate `kmeans_map` row-wise to gather
# kmeans models for each centre in column 2
mutate(glance = k_means %>%
map(glance)) # apply `glance()` row-wise to gather each
# model’s summary metrics in column 3
kmeans_map_tbl %>% glimpse()
```
Last, I can build a `scree plot` and look for the "elbow" on the graph, the point where the number of additional clusters seem to level off. In this case __5__ appears to be an optimal number as the drop in `tot.withinss` for 6 is not as pronounced as the previous one.
```{r}
kmeans_map_tbl %>%
unnest(glance) %>% # unnest the glance column
select(centers, tot.withinss) %>% # select centers and tot.withinss
ggplot(aes(x = centers, y = tot.withinss)) +
geom_line(colour = 'grey30', size = .8) +
geom_point(colour = 'green4', size = 3) +
geom_label_repel(aes(label = centers),
colour = 'grey30') +
theme_bw() +
labs(title = 'Scree Plot')
```
# Visualising the Segments
Now that I have identified the optimal number of clusters I want to visualise them. To do so, I use the [__UMAP__](https://umap-learn.readthedocs.io/en/latest/) ( _Uniform Manifold Approximation and Projection_ ), a dimentionality reduction technique that can be used for cluster visualisation in a similar way to [__Principal Component Analysis__](https://en.wikipedia.org/wiki/Principal_component_analysis) and [__t-SNE__](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding).
First, I create a umap object and pull out the `layout` argument (containing coordinates that can be used to visualize the dataset), change its format to a tibble and attach the `offer` column from the `wine_tbl`.
```{r}
umap_obj <- user_item_tbl %>% umap()
umap_tbl <- umap_obj$layout %>%
as_tibble() %>% # change to a tibble
set_names(c('x', 'y')) %>% # remane columns
bind_cols(wine_tbl %>% select(offer)) # attach offer reference
```
Then, I `pluck` the 5th kmeans model from the __nested tibble__, attach the _cluster_ argument from the `kmeans` function to the output, and join offer and cluster to the umap_tbl.
```{r}
umap_kmeans_5_tbl <- kmeans_map_tbl %>%
pull(k_means) %>%
pluck(5) %>% # pluck element 5
broom::augment(wine_tbl) %>% # attach .cluster to the tibble
select(offer, .cluster) %>%
left_join(umap_tbl, by = 'offer') # join umap_tbl to clusters by offer
```
At last, I can visualise the UMAP'ed projections of the clusters. `plotly` adds some nice interactivity, which brings the chart to life!
```{r}
g <- umap_kmeans_5_tbl %>%
mutate(label_text = str_glue('Offer: {offer}
Cluster: {.cluster}')) %>%
ggplot(aes(x,y, colour = .cluster)) +
geom_point(aes(text = label_text)) +
theme_light() +
labs(title = 'UMAP 2D Projections of K-Means Clusters',
caption = "") +
theme(legend.position = 'none')
ggplotly(g, tooltip = "label_text")
```
# Evaluating the Clusters
Now we can finally have a closer look at the single clusters to see what K-Means has identified.
But let's first bring all information together in one data frame.
```{r}
cluster_trends_tbl <- wine_tbl %>%
left_join(umap_kmeans_5_tbl) %>%
arrange(.cluster) %>%
select(.cluster, offer:past_peak)
```
### Cluster 1 & 2
Customers in __cluster 1__ purchase __high volumes of sparkling wines__ (Champagne and Prosecco) whilts those in the __second segment__ favour __low volume purchases__ of different varieties.
```{r}
cluster_trends_tbl %>%
filter(.cluster ==1 | .cluster ==2) %>%
count(.cluster, varietal, origin, min_qty_kg, disc_pct) %>%
select(-n) %>%
kable()
```
### Cluster 3 & 4
Customers in these groups have very specific taste when it comes to wine: those in the __third segment__ have a penchant for __Pinot Noir__, whereas __group 4__ only buys __French Champagne in high volumes__.
```{r}
cluster_trends_tbl %>%
filter(.cluster ==3 | .cluster ==4 ) %>%
group_by() %>%
count(.cluster, varietal, origin, min_qty_kg, disc_pct) %>%
select(-n) %>%
kable()
```
### Cluster 5
The __fifth segment__ is a little more difficult to categorise as it encompasses many different attributes. The only clear trend is that customers in this segment picked up all available __Cabernet Sauvignon__ offers.
```{r}
cluster_trends_tbl %>%
filter(.cluster ==5 ) %>%
count(.cluster, varietal, origin, min_qty_kg, disc_pct) %>%
select(-n) %>%
kable()
```
# Final thoughts
Although it's not going to give you all the answers, clustering is a powerful exploratory exercise that can help you reveal patterns in your consumer base, especially when you have a brand new market to explore and do not have any prior knowledge of it.
It's very easy to implement and even on a small dataset like the one I used here, you can unearth interesting patterns of behaviour in your customer base.
With little effort we have leaned that some of our customers _favour certain varieties of wine_ whereas others prefer to _buy high or low quantities_. Such information can be used to tailor your __pricing strategies__ and __marketing campaings__ towards those customers that are more inclined to respond. Moreover, customer segmentation allows for a more efficient allocation of marketing resources and the maximization of cross- and up-selling opportunities.
Segmentation can also be enriched by overlaying information such as a customers' __demographics__ (age, race, religion, gender, family size, ethnicity, income, education level), __geography__ (where they live and work), and __psychographic__ (social class, lifestyle and personality characteristics) but these go beyond the scope of this mini-project.
### Code Repository
The full R code can be found on [my GitHub profile](https://github.com/DiegoUsaiUK/Customer_Segmentation)
### References
* For a broader view on the potential of [Customer Segmentation](https://medium.com/@kristenkehrer/how-to-use-customer-segmentation-to-learn-f49e82b9a959)
* For a critique of some of [k-means drawback](https://www.datascience.com/blog/k-means-alternatives)
* [Customer Segmentation at Bain & Company](https://www.bain.com/insights/management-tools-customer-segmentation)