-
Notifications
You must be signed in to change notification settings - Fork 0
/
bellabeat_casestudy.Rmd
491 lines (368 loc) · 23 KB
/
bellabeat_casestudy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
---
title: Bellabeat Data Analysis Capstone Project
author: Ruiz del Carmen
portfolio: https://www.notion.so/ruizdelcarmen/Ruiz-del-Carmen-Data-Portfolio-e725748d0e0546c386be6c6c7dc49099
linkedin: https://www.linkedin.com/in/ruizdelcarmen/
github: https://github.com/r-uiz
hire: yes
output:
html_document:
keep_md: true
---
# Bellabeat Data Analysis Capstone Project
## Table of Contents
1. [Summary](#1-summary)
1. [Background](#11-background)
2. [This Project](#12-this-project)
2. [Ask Phase](#2-ask-phase)
1. [Business Task Statement](#21-business-task-statement)
3. [Prepare Phase](#3-prepare-phase)
1. [Data Source](#31-data-source)
2. [Accessibility and Privacy of Data](#32-accessibility-and-privacy-of-data)
3. [Information About Our Dataset](#33-information-about-our-dataset)
4. [Data Organization](#34-data-organization)
5. [Data Integrity and Credibility](#35-data-integrity-and-credibility)
4. [Process Phase](#4-process-phase)
1. [Installing Packages and Opening Libraries](#41-installing-packages-and-opening-libraries)
2. [Loading the Data](#42-loading-the-data)
3. [Preview the Data](#43-preview-the-data)
4. [Check the Data Structure](#44-check-the-data-structure)
5. [Data Cleaning](#45-data-cleaning)
1. [Check Number of Participants](#451-check-number-of-participants)
2. [Check for Duplicates](#452-check-for-duplicates)
3. [Remove Duplicates & Missing Values](#453-remove-duplicates--missing-values)
4. [Rename Columns](#454-rename-columns)
5. [Convert Date Columns](#455-convert-date-columns)
6. [Merge Data Sets](#46-merge-data-sets)
5. [Analyze & Share Phase](#5-analyze--share-phase)
1. [Daily Activity](#51-daily-activity)
1. [Insights](#insights)
2. [Daily Sleep](#52-daily-sleep)
1. [Insights](#insights-1)
3. [Daily Steps v. Calories Burned](#53-daily-steps-v-calories-burned)
1. [Insights](#insights-2)
4. [Hourly Intensity](#54-hourly-intensity)
1. [Insights](#insights-3)
5. [Hourly Steps](#55-hourly-steps)
1. [Insights](#insights-4)
6. [Steps by Weekday](#56-steps-by-weekday)
1. [Insights](#insights-5)
6. [Recommendations](#6-recommendations)
7. [References](#7-references)
## 1. Summary
### 1.1 Background
> This is a capstone project for Google Data Analytics Professional, and the following is the given situation.
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.
### 1.2 This Project
This study focuses on analyzing smart device usage data to gain insight into how consumers use non-Bellabeat smart devices. Insights gained will be applied to growth opportunities towards the Bellabeat products: primarily the **Time** smart watch, and subsequently the **Membership**.
- **Time**: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
- **Bellabeat membership**: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
## 2. Ask Phase
### 2.1 Business task statement
Garner insight from public data on use of wearable health-tracking technology that could influence and direct Bellabeat's marketing strategy, specifically for the **Time** smart watch and, subsequently, the **Membership** guidance.
**Stakeholders**
- **Urška Sršen:** Bellabeat's cofounder and Chief Creative Officer
- **Sando Mur:** Mathematician and Bellabeat's cofounder
- **Bellabeat marketing analytics team**
## 3. Prepare Phase
### 3.1 Data Source
The data source used for this case study is the [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit); a data source stored in Kaggle and was made available by [Möbius](https://www.kaggle.com/arashnic).
### 3.2 Accessibility and privacy of data:
The data source is verified to be available for public use and are public domain [CC0 1.0 Deed](https://creativecommons.org/publicdomain/zero/1.0/). The data source's author have waived their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
### 3.3 Information about our dataset:
- [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit)
- This dataset is generated by 30 respondents using a Fitbit to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.
- Columns information available on [Fitbit's data dictionary](https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf).
### 3.4 Data Organization:
Eighteen CSV files are available for analysis, each containing different quantitative data tracked by Fitbit. The data is organized in a long format, where each row represents a single time point per subject, resulting in multiple rows for each user. Each user has a unique ID, and the data is tracked by day and time.
### 3.5 Data Integrity and Credibility:
The dataset has limitations, including a small sample size (30 users) and a lack of demographic information, particularly gender data since Bellabeat is targeted for women, which may lead to sampling bias. This raises concerns about the sample's representativeness of the general population. Additionally, the dataset is not current, and the survey period was limited to two months. Therefore, the case study will adopt an operational approach.
## 4. Process Phase
For this analysis, I will be primarily using R due to ease of use, amount of data to be processed, easier documentation, and generation of data visualizations to share results with stakeholders.
### 4.1 Installing packages and opening libraries
Let's start by loading the necessary libraries that will aid our analysis.
```{r echo = T, results = 'hide', error=FALSE, message=FALSE, warning=FALSE}
library(tidyverse) # For data manipulation
library(skimr) # For data summary
library(janitor) # For cleaning column names
library(lubridate) # For date manipulation
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
```
### 4.2 Loading the data
The data is stored in 18 CSV files, and we will load each file into a separate data frame. We will then combine the data frames into a single data frame for analysis.
```{r echo = T, results = 'hide', error=FALSE, message=FALSE, warning=FALSE}
# Load the data
daily_activity <- read_csv("data/dailyActivity_merged.csv") %>%
as.data.frame()
daily_sleep <- read_csv("data/sleepDay_merged.csv") %>%
as.data.frame()
hourly_intensities <- read_csv("data/hourlyIntensities_merged.csv") %>%
as.data.frame()
hourly_calories <- read_csv("data/hourlyCalories_merged.csv") %>%
as.data.frame()
hourly_steps <- read_csv("data/hourlySteps_merged.csv") %>%
as.data.frame()
weight <- read_csv("data/weightLogInfo_merged.csv") %>%
as.data.frame()
```
### 4.3 Preview the data
Let's take a look at the first few rows of each data frame to understand the structure of the data.
```{r echo = T, results = 'hide'}
head(daily_activity)
head(daily_sleep)
head(hourly_intensities)
head(hourly_calories)
head(hourly_steps)
head(weight)
```
### 4.4 Check the data structure
Let's check the structure of each data frame to understand the variables and data types.
```{r echo = T, results = 'hide'}
str(daily_activity)
str(daily_sleep)
str(hourly_intensities)
str(hourly_calories)
str(hourly_steps)
str(weight)
```
### 4.5 Data Cleaning
We will clean the data by addressing missing values, renaming columns, and converting data types to facilitate analysis.
#### 4.5.1 Check number of participants
Let's check the number of participants in the dataset to ensure that the sample size is consistent across all data frames.
```{r}
# Check the number of participants in each data frame
length(unique(daily_activity$Id))
length(unique(daily_sleep$Id))
length(unique(hourly_intensities$Id))
length(unique(hourly_calories$Id))
length(unique(hourly_steps$Id))
length(unique(weight$Id))
```
Weight data has too little participants compared to the other data frames. We will exclude this data frame from the analysis to avoid bias since the sample size is too small. All other data frames have 33 participants, except for daily_sleep which has 24 participants.
#### 4.5.2 Check for Duplicates
Let's check for duplicates in each data frame to ensure data integrity.
``` {r echo = T, results = 'hide'}
# Check for duplicates in each data frame
sum(duplicated(daily_activity))
sum(duplicated(daily_sleep))
sum(duplicated(hourly_intensities))
sum(duplicated(hourly_calories))
sum(duplicated(hourly_steps))
```
#### 4.5.3 Remove Duplicates & Missing Values
Let's remove duplicates and address missing values in each data frame.
```{r}
# Remove duplicates and missing values
daily_activity <- daily_activity %>% distinct() %>% drop_na()
daily_sleep <- daily_sleep %>% distinct() %>% drop_na()
hourly_intensities <- hourly_intensities %>% distinct() %>% drop_na()
hourly_calories <- hourly_calories %>% distinct() %>% drop_na()
hourly_steps <- hourly_steps %>% distinct() %>% drop_na()
```
#### 4.5.4 Rename Columns
Let's standardize the column names in each data frame to ensure consistency and ease of analysis.
```{r echo = T, results = 'hide'}
clean_names(daily_activity)
daily_activity <- rename_with(daily_activity, tolower)
clean_names(daily_sleep)
daily_sleep <- rename_with(daily_sleep, tolower)
clean_names(hourly_intensities)
hourly_intensities <- rename_with(hourly_intensities, tolower)
clean_names(hourly_calories)
hourly_calories <- rename_with(hourly_calories, tolower)
clean_names(hourly_steps)
hourly_steps <- rename_with(hourly_steps, tolower)
```
#### 4.5.5 Convert Date Columns
Let's convert the date columns to the appropriate date format for analysis.
```{r}
daily_activity <- daily_activity %>%
rename(date = activitydate) %>%
mutate(date = mdy(date))
daily_sleep <- daily_sleep %>%
rename(date = sleepday) %>%
mutate(date = mdy_hms(date))
hourly_intensities <- hourly_intensities %>%
rename(date_time = activityhour) %>%
mutate(date_time = mdy_hms(date_time))
hourly_calories <- hourly_calories %>%
rename(date_time = activityhour) %>%
mutate(date_time = mdy_hms(date_time))
hourly_steps <- hourly_steps %>%
rename(date_time = activityhour) %>%
mutate(date_time = mdy_hms(date_time))
```
### 4.6 Merge Data Sets
Let's merge the daily data sets into a single data frame for simplicity during analysis.
```{r}
daily_data <- merge(daily_activity,daily_sleep, by =c ("id","date"))
```
Now let's merge the hourly data sets into a single data frame as well.
```{r}
hourly_data <- merge(hourly_intensities,hourly_calories, by =c ("id","date_time")) %>%
merge(hourly_steps, by =c ("id","date_time"))
```
```{r}
# See column structures
str(daily_data)
str(hourly_data)
# Preview the merged data sets
head(daily_data)
head(hourly_data)
```
## 5. Analyze & Share Phase
Let's conduct exploratory data analysis to gain insights into the data and identify trends that could inform Bellabeat's marketing strategy.
### 5.1 Daily Activity
Let's start by analyzing daily activity data to understand user behavior.
```{r}
# Summary statistics for daily activity data
daily_activity %>%
select(totalsteps, calories, sedentaryminutes, lightlyactiveminutes, fairlyactiveminutes, veryactiveminutes) %>%
skim()
```
#### Insights:
- **Total Steps**: The average number of steps taken by users is **7638**, Walking _10,000_ steps daily is associated with several health benefits, including improved cardiovascular health, weight management, better mood, and enhanced joint health. Regular walking can lower the risk of heart disease, diabetes, and high blood pressure, while also helping to reduce stress and improve overall mental well-being [[1]](https://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/steps/faq-20430164)[[2]](https://www.heart.org/en/news/2021/11/16/is-10000-steps-really-a-magic-number-for-health)[[3]](https://health.clevelandclinic.org/do-you-really-need-10000-steps-a-day/). This suggests that users are not meeting the recommended daily step count.
- **Activity Levels**: While some participants meet recommended physical activity levels, many do not. There is a significant variation in physical activity levels among participants, with some being highly active and others largely sedentary. This indicates that there is an opportunity to encourage more users to engage in physical activity.
### 5.2 Daily Sleep
Next, let's analyze daily sleep data to understand user sleep patterns.
```{r}
# Summary statistics for daily sleep data
daily_sleep %>%
select(totalminutesasleep, totaltimeinbed) %>%
skim()
```
Let's create a visualization grouped by weekday.
```{r}
# Create a new column for the weekday
daily_sleep <- daily_sleep %>%
mutate(weekday = wday(date, label = TRUE))
# Plot total minutes asleep by weekday
daily_sleep %>%
ggplot(aes(x = weekday, y = totalminutesasleep, fill = weekday)) +
geom_boxplot() +
labs(title = "Total Minutes Asleep by Weekday",
x = "Weekday",
y = "Total Minutes Asleep") +
theme_minimal()
```
Summary of the data by weekday
```{r}
# Summary of total minutes asleep by weekday
daily_sleep %>%
group_by(weekday) %>%
summarize(avg_total_minutes_asleep = mean(totalminutesasleep))
```
#### Insights:
- **Total Minutes Asleep**: The average total minutes asleep is **419.8**, which is below the recommended 7-9 hours of sleep per night for adults. Sleep is essential for overall health and well-being, with insufficient sleep linked to various health issues, including obesity, heart disease, and mental health problems [[4]](https://www.cdc.gov/sleep/about/?CDC_AAref_Val=https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html)[[5]](https://www.sleepfoundation.org/how-sleep-works/how-much-sleep-do-we-really-need).
- **Weekday vs. Weekend Sleep**: Users tend to sleep longer on weekends compared to weekdays. Sleep time during weekdays are mostly less than the minimum of 7 hours. This suggests that users may be catching up on sleep during the weekend, indicating that they may not be getting enough sleep during the week.
### 5.3 Daily Steps v. Calories Burned
Let's analyze the relationship between daily steps and calories burned to understand the impact of physical activity on energy expenditure.
```{r}
# Summary statistics for hourly activity data
daily_activity %>%
select(totalsteps, calories) %>%
skim()
```
Let's create a visualization to check correlation between steps and calories burned.
```{r}
# Create a scatter plot of steps vs. calories
ggplot(data = daily_activity, aes(x = totalsteps, y = calories)) +
geom_point() +
geom_smooth() +
labs(title = "Total Steps vs. Calories") +
theme_minimal()
```
#### Insights:
- **Steps vs. Calories**: There is a positive correlation between the number of steps taken and the number of calories burned. This suggests that users who take more steps tend to burn more calories, which is essential for weight management and overall health. Encouraging users to increase their daily step count could help improve their overall health and well-being.
### 5.4 Hourly Intensity
Let's now take a look at data on hourly intensity to understand activity patterns. We first need to split date and time values.
```{r warning = FALSE}
hourly_intensities <- hourly_intensities %>%
separate(date_time, into = c("date", "hour"), sep= " ")
head(hourly_intensities)
hourly_intensities <- hourly_intensities %>%
group_by(hour) %>%
drop_na() %>%
summarise(avg_total_int = mean(totalintensity))
```
Let's make a visualization off this data.
```{r warning=FALSE, error=FALSE}
ggplot(data = hourly_intensities, aes(x = hour,y = avg_total_int)) +
geom_histogram(stat='identity',fill = '#350352') +
labs(title = "Average Total Intensity vs Hour") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
#### Insights:
- **Hourly Intensity**: The average total intensity varies throughout the day, with peaks in the morning and evening. This suggests that users are more active during these times, which could be due to work schedules, exercise routines, or other factors. Understanding these patterns can help Bellabeat tailor their marketing strategies to target users during peak activity times.
- **Peak Activity Times**: The data shows that users are most active in the morning and evening, which are common times for exercise and physical activity. In the evenings, specifically around 5:00pm to 7:00pm, are times when people usually get off work. This information can be used to target users with marketing messages promoting physical activity during these peak times.
### 5.5 Hourly Steps
Let's analyze hourly steps data to understand user step patterns throughout the day. We first need to split date and time values.
```{r warning=FALSE, error=FALSE}
hourly_steps <- hourly_steps %>%
separate(date_time, into = c("date", "hour"), sep= " ")
head(hourly_steps)
hourly_steps <- hourly_steps %>%
group_by(hour) %>%
drop_na() %>%
summarise(avg_total_steps = mean(steptotal))
```
Let's make a visualization off this data.
```{r warning=FALSE, error=FALSE}
ggplot(data = hourly_steps, aes(x = hour,y = avg_total_steps, fill = avg_total_steps)) +
geom_histogram(stat='identity') +
labs(title = "Average Total Steps vs Hour") +
theme_minimal() +
scale_fill_gradient(low = "red", high = "green")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
#### Insights:
- **Hourly Steps**: This data shows the same pattern as hourly intensity, with peaks in the morning and evening. Users tend to take more steps during these times, which means we could also suggest to target users with marketing messages promoting physical activity regarding step count during these peak times.
### 5.6 Steps by Weekday
Let's analyze the average number of steps taken by users on each weekday to understand weekly activity patterns.
```{r}
# Create a new column for the weekday
daily_activity <- daily_activity %>%
mutate(weekday = wday(date, label = TRUE))
```
Let's create a visualization to show the average steps taken by users on each weekday, with a horizontal line at both 7.5k and 10k steps.
```{r}
# Plot average steps by weekday
daily_activity %>%
ggplot(aes(x = weekday, y = totalsteps, fill = weekday)) +
geom_boxplot() +
geom_hline(yintercept = 7500, linetype = "dashed", color = "red") +
geom_hline(yintercept = 10000, linetype = "dashed", color = "green") +
labs(title = "Average Steps by Weekday",
x = "Weekday",
y = "Total Steps") +
theme_minimal()
```
#### Insights:
- **Steps by Weekday**: Users tend to take more steps on weekends compared to weekdays. This suggests that users may be more active on weekends, which could be due to having more free time to engage in physical activities. Bellabeat could leverage this information to encourage users to maintain their activity levels during the week.
- **Average Steps**: Although a lot of data suggests that 10k steps is the recommended daily step count, a minimum of 7.5k steps is also beneficial for health. The data shows that users are mostly just below the 7.5k steps mark, indicating that they may not be meeting the minimum recommended daily step count. [[6]](https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2734709)[[7]](https://www.nih.gov/news-events/nih-research-matters/how-many-steps-better-health)[[8]](https://www.health.harvard.edu/blog/10000-steps-a-day-or-fewer-2019071117305)
## 6 Recommendations
**Bellabeat's** mission is to empower women's health through technology and data. Based on the data analysis, here are key marketing strategy recommendations:
1. **Monthly Events**: Organize monthly challenges or events to encourage users to increase their daily step count and physical activity levels. **Offer rewards or incentives to motivate participation when they use Bellabeat products**.
2. **Target Peak Activity Times**: Use notifications to engage users during **peak times** (morning and evening) to encourage physical activity. Weekends are also a good time to promote wellness activities since users tend to be more active during this time.
3. **Goal Setting**: Encourage users to set daily step goals and track progress to **motivate them to stay active**.
### Specifically For Bellabeat's **Time** Smart Watch:
1. **Improve Activity Tracking**: Provide real-time feedback and **encourage daily activity**. Maybe a vibration alert when users are inactive for too long, or a notification when they reach their daily step goal to celebrate their achievement.
2. **Enhance Sleep Monitoring**: Offer insights and recommendations to improve sleep quality. **Provide bedtime reminders** to help users establish a healthy sleep routine.
3. **Introduce Stress Management**: Provide tools to help manage stress and promote relaxation. Offer **guided breathing exercises** or mindfulness activities to reduce stress levels.
### Specifically For Bellabeat's **App**:
1. **Personalized Guidance**: Offer tailored advice on wellness, as well as data visualization to **help users understand their health and wellness trends**.
2. **Resources and Tips**: Provide articles, videos, and resources on physical activity, sleep, nutrition, and mental health to educate and motivate users. **Could also become another revenue stream through partnerships with health and wellness brands.**
3. **Community Support**: Create a user community for shared experiences and motivation. **Encourage users to share their progress, challenges, and successes** and provide a platform for peer support.
## 7. References
1. [Mayo Clinic - Walking: Trim your waistline, improve your health](https://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/steps/faq-20430164)
2. [American Heart Association - Is 10,000 steps really a magic number for health?](https://www.heart.org/en/news/2021/11/16/is-10000-steps-really-a-magic-number-for-health)
3. [Cleveland Clinic - Do You Really Need 10,000 Steps a Day?](https://health.clevelandclinic.org/do-you-really-need-10000-steps-a-day/)
4. [CDC - How Much Sleep Do I Need?](https://www.cdc.gov/sleep/about/?CDC_AAref_Val=https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html)
5. [National Sleep Foundation - How Much Sleep Do We Really Need?](https://www.sleepfoundation.org/how-sleep-works/how-much-sleep-do-we-really-need)
6. [JAMA Network - Association of Step Volume and Intensity With All-Cause Mortality in Older Women](https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2734709)
7. [NIH Research Matters - How Many Steps Are Better for Health?](https://www.nih.gov/news-events/nih-research-matters/how-many-steps-better-health)
8. [Harvard Health - 10,000 steps a day — or fewer?](https://www.health.harvard.edu/blog/10000-steps-a-day-or-fewer-2019071117305)
---
[Back to Top](#bellabeat-data-analysis-capstone-project)