forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
157 lines (122 loc) · 6.82 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{r, echo=TRUE}
# set working directory
path_wd = "C:\\Users\\Frank\\Dropbox\\Coursera\\Reproducible_Research\\RepData_PeerAssessment1"
setwd(path_wd)
# load required packages
library(dplyr)
library(lubridate)
library(sqldf)
library(lattice)
################################################################################
# download and extract datasets
# download raw data (no need, already in forked repo)
# fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
# download.file(fileurl, "repdata_data_activity")
# dateDownloaded <- date()
# unzip downloaded data
unzip("activity.zip")
# list.files("./")
################################################################################
# read in and format dataset
data <- read.csv("activity.csv", stringsAsFactors = FALSE)
data <- tbl_df(data)
data <- mutate(data, date = ymd(date))
```
## What is mean total number of steps taken per day?
```{r, echo=TRUE}
# 1. Calculate the total number of steps taken per day
data <- group_by(data, date)
total_steps_day <- summarise(data, total_steps_day = sum(steps, na.rm = TRUE))
# 2. Make a histogram of the total number of steps taken each day
hist(total_steps_day$total_steps_day, main = "Total Number of Steps Taken Per Day",
xlab = "Steps")
# 3. Calculate and report the mean and median of the total number of steps taken per day
mean_total_steps_day = summarise(total_steps_day, mean(total_steps_day, na.rm = TRUE))
median_total_steps_day = summarise(total_steps_day, median(total_steps_day, na.rm = TRUE))
```
The **MEAN** total number of steps taken per day is approx. **`r round(mean_total_steps_day, digits = 2)`**
The **MEDIAN** total number of steps taken per day is **`r median_total_steps_day`**
## What is the average daily activity pattern?
```{r, echo=TRUE}
# 1. Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis)
# and the average number of steps taken, averaged across all days (y-axis)
data <- group_by(data, interval)
mean_steps_interval <- summarise(data, mean_steps_interval = mean(steps, na.rm = TRUE))
plot(mean_steps_interval$interval, mean_steps_interval$mean_steps_interval, type = "l",
main = "Average Number of Steps Taken Per Interval", xlab = "Interval",
ylab = "Steps")
# 2. Which 5-minute interval, on average across all the days in the dataset,
# contains the maximum number of steps?
max_mean_steps_interval <- sqldf("select interval, max(mean_steps_interval) as
max_mean_steps_interval from mean_steps_interval")
```
The interval with the **MAXIMUM AVERAGE** number of steps per day is **`r max_mean_steps_interval$interval`** with approx. **`r round(max_mean_steps_interval$max_mean_steps_interval, digits = 2)`** steps
## Imputing missing values
```{r, echo=TRUE}
# 1. Calculate and report the total number of missing values in the dataset
# (i.e. the total number of rows with NAs)
data_NA <- filter(data, is.na(steps))
```
**`r nrow(data_NA)`** records are missing values (steps) in the dataset
```{r, echo=TRUE}
# 2. Devise a strategy for filling in all of the missing values in the dataset.
# substitute mean value of matching 5 minute interval for NA values
data_NA_join_mean_int <- left_join(data_NA, mean_steps_interval, "interval")
data_NA_filled <- mutate(data_NA_join_mean_int, steps = mean_steps_interval)
data_NA_filled <- sqldf("select steps, date, interval from data_NA_filled")
rm(data_NA)
rm(data_NA_join_mean_int)
# 3. Create a new dataset that is equal to the original dataset but with the missing data filled in.
data_complete <- filter(data, !is.na(steps))
data_filled <- rbind(data_complete, data_NA_filled)
rm(data_complete)
rm(data_NA_filled)
# 4. Make a histogram of the total number of steps taken each day and Calculate
# and report the mean and median total number of steps taken per day.
# Do these values differ from the estimates from the first part of the assignment?
# What is the impact of imputing missing data on the estimates of the total daily number of steps?
# histogram of the total number of steps taken each day
data_filled <- group_by(data_filled, date)
total_steps_day_filled <- summarise(data_filled, total_steps_day = sum(steps))
hist(total_steps_day_filled$total_steps_day,
main = "Total Number of Steps Taken Each Day\n(with missing values replaced by interval means)",
xlab = "Steps")
# mean and median total number of steps taken per day
mean_total_steps_day_filled <- summarise(total_steps_day_filled,
mean_total_steps_day = mean(total_steps_day))
median_total_steps_day_filled <- summarise(total_steps_day_filled,
median_total_steps_day = median(total_steps_day))
```
The **MEAN** total number of steps taken per day (with imputed data) is approx.
**`r format(round(mean_total_steps_day_filled, digits = 2), scientific = FALSE)`**
The **MEDIAN** total number of steps taken per day (with imputed data) is approx.
**`r format(round(median_total_steps_day_filled, digits = 2), scientific = FALSE)`**
Imputing missing data replaced a number of the zero values, shifted the distribution of total steps per day to the right and made it more normal. The mean and median values have both increased and are almost identical.
## Are there differences in activity patterns between weekdays and weekends?
```{r, echo=TRUE}
# 1. Create a new factor variable in the dataset with two levels - # "weekday"
# and "weekend" indicating whether a given date is a weekday or weekend day.
data_filled <- ungroup(data_filled)
data_filled <- mutate(data_filled,
week = ifelse(weekdays(data_filled$date) == "Saturday"|
weekdays(data_filled$date) == "Sunday",
"weekend", "weekday"))
data_filled <- mutate(data_filled, week = as.factor(week))
# 2. Make a panel plot containing a time series plot (i.e. type = "l") of the
# 5-minute interval (x-axis) and the average number of steps taken, averaged
# across all weekday days or weekend days (y-axis).
data_filled <- group_by(data_filled, week, interval)
mean_steps_interval_filled <- summarise(data_filled, mean_steps_interval = mean(steps))
## Plot with lattice
xyplot(mean_steps_interval ~ interval | week, mean_steps_interval_filled, type = "l",
main = "Average Number of Steps Taken Per Interval By Day of Week\n(with missing values replaced by interval means)",
xlab = "Interval", ylab = "Number of Steps", layout = c(1, 2))
```
Yes there appear to be differences in the activity patterns. This person appears to be most active in the morning on weekdays and is more consistently active throughout the day on weekends.