-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_03_data_wrangling.Rmd
347 lines (228 loc) · 11.5 KB
/
02_03_data_wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
title: "Session Three"
subtitle: "Transforming and wrangling data 1/2"
author: "Akos Mate"
date: '2018 July'
output:
html_document:
toc: true
toc_depth: 3
theme: readable
css: style.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
collapse = TRUE,
comment = "#>",
message = FALSE
)
```
> Main packages used: `base R`, `tidyr`, `ggplot2`, `dplyr`
> Main functions covered: `tidyr::gather`, `tidyr::spread`, `dplyr::group_by`, `dplyr::summarise`, `dplyr::bind_rows`, `dplyr::bind_cols` ,`runif`, `[`, `[[`
> Supplementary resources: [Data Wrangling cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
# 1. Tidyverse overview
From this session we will mostly start using packages from the `tidyverse` ecosystem. These include:
* `readr` for reading text data (.csv and .tsv)
* `tidyr` for reshaping your data
* `dplyr` for wrangling data (filtering your data, subsetting, transforming and recoding variables, etc.)
* `purrr` for functional programming
* `ggplot2` for data visualization
* `R markdown` for creating reports straight from R (.pdf, .html or .doc)
```{r, out.width = "550px", echo=FALSE}
knitr::include_graphics("tidyverse1.png")
```
The "base R vs. tidyverse" debate has been going on in the R community for a while now, but it provides a quick and consisent way to start data analysis, strong community support and robust adoption rate in both inside and outside of academia. If you want some background thoughts on this, you can read a blogpost of David Robinson, chief data scientist at DataCamp on why teach tidyverse first: [http://varianceexplained.org/r/teach-tidyverse/](http://varianceexplained.org/r/teach-tidyverse/).
We will still go over base R to provide you some alternatives if you need.
## 1.1 The pipe
The pipe is this operator: ` %>% ` You can access with the shortcut of `Ctrl+Shift+M` (or type it out every time, but there are better things in life than that).
What it does, is it passes object on left hand side as first argument (or . argument) of function on righthand side.
As an example:
```{r, eval=FALSE}
head(mydata)
# OR
mydata %>% head()
```
Piping together various steps of our data manipulation process greatly increases code readability and our quality of life. In a moment we will see how piping can be super useful.
Rejoice!
# 2. Tidy data
Let's load our packages.
```{r}
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
# if you need to install a package, use the install.packages("package") function!
```
To exploit R's vectorized nature, we should use "tidy data" whenever possible. This means that each row is an observation and each column is a variable. This is best understood with this figure from the [R for Data Science](http://r4ds.had.co.nz/tidy-data.html):
```{r, out.width = "650px", echo=FALSE}
knitr::include_graphics("tidy-data.png")
```
This format is often referred as the long data format. An example of the wide data format is where columns are values, not variables. The tidy version transforms this to create a year and inflation variables. To illustrate it with a made up dataset:
```{r echo=FALSE}
set.seed(2018)
wide_df <- data.frame(
country = c("Hungary", "Belgium", "Mongolia"),
`2006` = round(runif(3, min = 1, max = 3),1),
`2007` = round(runif(3, min = 1, max = 4),1),
`2008` = round(runif(3, min = 1, max = 5),1),
`2009` = round(runif(3, min = 1, max = 2),1),
check.names=FALSE
)
```
```{r echo=FALSE}
long_df <- wide_df %>%
gather(key = "year", value = inflation, "2006":"2009")
```
The wide data format
```{r echo=FALSE, collapse=FALSE}
# wide data format
wide_df
```
And the transformed long data format.
```{r echo=FALSE, collapse=FALSE}
# long data format
long_df
```
Illustrating of what is happening from the [R for Data Science](http://r4ds.had.co.nz/tidy-data.html):
```{r, out.width = "650px", echo=FALSE}
knitr::include_graphics("tidy_gather.png")
```
## 2.1 Transforming datasets
We will use the `tidyr` package to shape our data into the desired format. A key concept is the key-value pairs. For the wide dataset above, the values are the inflation numbers and the keys are the column headers (years). So the `1.2` (2007 inflation in Hungary) and `2007` is such a key-value pair.
We will use this concept to our transformations. The two main functions we'll use from `tidyr` are the `gather()` to create tidy data from wide data format.
Let's create our dummy data frames
```{r}
# we create a dummy dataset. The `runif()` function samples from a uniform distribution within the specified range.
wide_df <- data.frame(
country = c("Hungary", "Belgium", "Mongolia"),
`2006` = round(runif(3, min = 1, max = 3),1), # R does not allow numbers to be colnames, so use ` `
`2007` = round(runif(3, min = 1, max = 4),1),
`2008` = round(runif(3, min = 1, max = 5),1),
`2009` = round(runif(3, min = 1, max = 2),1),
check.names=FALSE # we tell R that it shouldn't worry our blatant use of numbers as colnames
)
wide_df
```
Let's use the `gather` function and let's meet the ` %>% ` pipe operator (shortcut: Ctrl + Shift + M) again! Here, we specify the range of columns that contain our key-value pairs, and also the `key` (the year in this case) and the `value` (inflation).
```{r}
long_df <- wide_df %>%
gather("2006":"2009", key = "year", value = "inflation")
long_df
```
Perfect! What if we want to go back to our wide format? Let's use the `spread()` function!
```{r}
long_df %>% spread(key = "year", value = "inflation")
```
Excellent!
> **Excercise:**
(1) load a data set on financial data of Microsoft and Tesla from this link: "https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data.csv".
(2) Try to use the readr::read_csv() and load directly from the url instead of downloading the file. The dataset contains 3 columns: a date and closing prices for Tesla and Microsoft shares.
(3) Use the `gather()` function to create a new data frame, with 3 variables: date, company and stock_closing. You should see something similar than below.
Solution:
```{r}
stock_df <- read_csv("https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data.csv")
head(stock_df, 5)
stock_tidy <- stock_df %>%
gather("TSLA.Close", "MSFT.Close", key = "company", value = "stock_closing")
glimpse(stock_tidy)
head(stock_tidy)
unique(stock_tidy$company)
```
> (4) As a refresher, use the `group_by()` and `summarise()` functions to see the minimum, maximum and mean closing prices for the two company. Use the ` %>% ` pipe operator! You should get something similar as below.
solution:
```{r }
stock_tidy %>%
group_by(company) %>%
summarise(min = min(stock_closing), max = max(stock_closing), mean = mean(stock_closing))
```
> (5) Finally, use `ggplot2` to visualise the closing prices of the two company. Use the `geom_line()`! You should get a similar plot as below.
solution:
```{r }
p_stock <- ggplot(stock_tidy,
aes(date, stock_closing,
color = company))
p_stock +
geom_line() +
labs(x = "", y = "Prices (USD)",
title = "Closing daily prices for MSFT and TSLA",
subtitle = "Data from 2016-01-10 to 2018-01-10",
caption = "source: Yahoo Finance")
```
## 2.2 Data merging and splitting
While working with data, you rarely have the luxury to have all the data you need in one place. Let's recreate our previous data frame (this is one way of merging two vectors as well).
```{r}
country <- c("Thailand", "Norway", "Colombia", "Canada", "Slovenia", "France")
pop <- c(68.7, 5.2, 47.8, 35.8, 2, 63.6)
df1 <- data.frame(country, pop)
```
It just so happens that we now have data on their inflation and GDP growth.
```{r}
v1 <- runif(n = 6, min = 1, max = 3) # we use the runif() function to sample from a uniform distribution
v2 <- runif(n = 6, min = -3, max = 3)
```
We have multiple options. First, assuming the order of the values in our new valuables matches the original one, we can simply add the two column. Just be mindful that here we match by *position*. For this we use the `bind_cols` from the `dplyr` package.
```{r}
df2 <- df1 %>%
bind_cols(inflation = v1, gdp_pct = v2)
df2
```
If we have more observations (rows) we can slap them onto our dataframe as well.
```{r}
df3 <- data.frame(
country = c("Iceland", "Suriname"),
pop = c(1.2, 0.6),
inflation = c(0.1, 8.9),
gdp_pct = c(0.5, 4.2))
df3
```
We should use the `bind_rows` function from `dplyr`.
```{r}
df2 %>% bind_rows(df3)
```
> Quick excercise:
(1) load additional data from "https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data2.csv" and perform the same reshaping with the `gather()` function as before.
(2) then use `bind_rows()` to add the two new company to the previous `stock_tidy` data frame, by creating a new data frame, called `stock_tidy2`.
(3) Use the %>% to pipe together your steps. You can use the `unique()` function to check how many unique value you have in a given column.
Solution:
```{r }
stock_tidy2 <- read_csv("https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data2.csv")
stock_tidy2 <- stock_tidy2 %>%
gather("AAPL.Close", "F.Close", key = "company", value = "stock_closing") %>%
bind_rows(stock_tidy)
unique(stock_tidy2$company)
```
```{r collapse=FALSE}
head(stock_tidy2, 5)
unique(stock_tidy2$company)
```
The corresponding `base R` commands are `cbind()` and `rbind()`.
# 3. quick detour: `base R` indexing and subsetting
(This section was adopted from Martin Mölder's last year course)
We have covered some parts of it already, but (if time permits) this section will give a short rundown of how to split datasets with `base R` indexing. While I think the `dplyr` approach that we will be using after this is more straightforward it is useful to at least know that there are multiple ways to access/subset/split datasets. Many online help that you'll see will either use the `dplyr` or the `base` approach and it is helpful if you can reverse engineer it.
To access a row or column in the data frame, remember that indexing in R works as `[row, column]`.
```{r}
# let's see the third row of our dataset
df2[3,]
# of course you can specify a vector of rows as well
df2[c(1:3),]
# or give individual columns if they don't happen to follow each other
df2[c(1,3),]
```
For splitting databases based on some conditions, let's remember our logical operators from before. Let's select all of the countries which are above 30 million in population. First, let's check how many values meet this criteria.
```{r}
df2$pop > 30
```
This will give as a vector of logical values that has the same length as we have rows in the data set. And we can use that to select rows – the rows with `TRUE` are selected and the rows with `FALSE` are not. We just have to put the vector of logical values where the row index of the data object is.
```{r}
df2[df2$pop > 30, ]
```
We can apply this conditional logic to selecting columns as well. We use the `|` OR operator.
```{r collapse=FALSE}
names(df2) == "pop"
names(df2) == "gdp_pct"
vars <- names(df2) == "pop" | names(df2) == "gdp_pct"
vars
df2[,vars]
# or just use the names of your variables directly
df2[,c("pop", "gdp_pct")]
```