02_03_data_wrangling.Rmd

---
title: "Session Three"
subtitle: "Transforming and wrangling data 1/2"
author: "Akos Mate"
date: '2018 July'
output:
    html_document:
        toc: true
        toc_depth: 3
        theme: readable
        css: style.css
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      collapse = TRUE,
                      comment = "#>",
                      message = FALSE
)
```


> Main packages used: `base R`, `tidyr`, `ggplot2`, `dplyr`     
> Main functions covered: `tidyr::gather`, `tidyr::spread`, `dplyr::group_by`, `dplyr::summarise`, `dplyr::bind_rows`, `dplyr::bind_cols` ,`runif`, `[`, `[[`    
> Supplementary resources: [Data Wrangling cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

# 1. Tidyverse overview

From this session we will mostly start using packages from the `tidyverse` ecosystem. These include:

* `readr` for reading text data (.csv and .tsv)
* `tidyr` for reshaping your data
* `dplyr` for wrangling data (filtering your data, subsetting, transforming and recoding variables, etc.)
* `purrr` for functional programming
* `ggplot2` for data visualization
* `R markdown` for creating reports straight from R (.pdf, .html or .doc)

```{r, out.width = "550px", echo=FALSE}
knitr::include_graphics("tidyverse1.png")
```

The "base R vs. tidyverse" debate has been going on in the R community for a while now, but it provides a quick and consisent way to start data analysis, strong community support and robust adoption rate in both inside and outside of academia. If you want some background thoughts on this, you can read a blogpost of David Robinson, chief data scientist at DataCamp on why teach tidyverse first: [http://varianceexplained.org/r/teach-tidyverse/](http://varianceexplained.org/r/teach-tidyverse/).

We will still go over base R to provide you some alternatives if you need.

## 1.1 The pipe

The pipe is this operator: ` %>% ` You can access with the shortcut of `Ctrl+Shift+M` (or type it out every time, but there are better things in life than that).  

What it does, is it passes object on left hand side as first argument (or . argument) of function on righthand side. 

As an example:
```{r, eval=FALSE}
head(mydata)

# OR

mydata %>% head()
```

Piping together various steps of our data manipulation process greatly increases code readability and our quality of life. In a moment we will see how piping can be super useful.

Rejoice!

# 2. Tidy data

Let's load our packages.

```{r}
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)

# if you need to install a package, use the install.packages("package") function!
```


To exploit R's vectorized nature, we should use "tidy data" whenever possible. This means that each row is an observation and each column is a variable. This is best understood with this figure from the [R for Data Science](http://r4ds.had.co.nz/tidy-data.html):


```{r, out.width = "650px", echo=FALSE}
knitr::include_graphics("tidy-data.png")
```

This format is often referred as the long data format. An example of the wide data format is where columns are values, not variables. The tidy version transforms this to create a year and inflation variables. To illustrate it with a made up dataset:

```{r echo=FALSE}
set.seed(2018)

wide_df <- data.frame(
    country = c("Hungary", "Belgium", "Mongolia"),
    `2006` = round(runif(3, min = 1, max = 3),1),
    `2007` = round(runif(3, min = 1, max = 4),1),
    `2008` = round(runif(3, min = 1, max = 5),1),
    `2009` = round(runif(3, min = 1, max = 2),1),
    check.names=FALSE
)

```



```{r echo=FALSE}
long_df <- wide_df %>% 
gather(key = "year", value = inflation, "2006":"2009")
```

The wide data format

```{r echo=FALSE, collapse=FALSE}
# wide data format
wide_df

```
And the transformed long data format.
```{r echo=FALSE, collapse=FALSE}
# long data format
long_df
```

Illustrating of what is happening from the [R for Data Science](http://r4ds.had.co.nz/tidy-data.html):
```{r, out.width = "650px", echo=FALSE}
knitr::include_graphics("tidy_gather.png")
```

## 2.1 Transforming datasets

We will use the `tidyr` package to shape our data into the desired format. A key concept is the key-value pairs. For the wide dataset above, the values are the inflation numbers and the keys are the column headers (years). So the `1.2` (2007 inflation in Hungary) and `2007` is such a key-value pair.

We will use this concept to our transformations. The two main functions we'll use from `tidyr` are the `gather()` to create tidy data from wide data format. 

Let's create our dummy data frames

```{r}

# we create a dummy dataset. The `runif()` function samples from a uniform distribution within the specified range.
wide_df <- data.frame(
    country = c("Hungary", "Belgium", "Mongolia"),
    `2006` = round(runif(3, min = 1, max = 3),1), # R does not allow numbers to be colnames, so use ` `
    `2007` = round(runif(3, min = 1, max = 4),1),
    `2008` = round(runif(3, min = 1, max = 5),1),
    `2009` = round(runif(3, min = 1, max = 2),1),
    check.names=FALSE # we tell R that it shouldn't worry our blatant use of numbers as colnames
)

wide_df

```

Let's use the `gather` function and let's meet the ` %>% ` pipe operator (shortcut: Ctrl + Shift + M) again! Here, we specify the range of columns that contain our key-value pairs, and also the `key` (the year in this case) and the `value` (inflation).

```{r}
long_df <- wide_df %>% 
gather("2006":"2009", key = "year", value = "inflation")


long_df
```

Perfect! What if we want to go back to our wide format? Let's use the `spread()` function!

```{r}
long_df %>% spread(key = "year", value = "inflation")
    
```

Excellent!

> **Excercise:**  
(1) load a data set on financial data of Microsoft and Tesla from this link: "https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data.csv".  
(2) Try to use the readr::read_csv() and load directly from the url instead of downloading the file. The dataset contains 3 columns: a date and closing prices for Tesla and Microsoft shares.  
(3) Use the `gather()` function to create a new data frame, with 3 variables: date, company and stock_closing. You should see something similar than below.

Solution:
```{r}
stock_df <- read_csv("https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data.csv")



head(stock_df, 5)



stock_tidy <- stock_df %>% 
    gather("TSLA.Close", "MSFT.Close", key = "company", value = "stock_closing")



glimpse(stock_tidy)

head(stock_tidy)

unique(stock_tidy$company)
```

> (4) As a refresher, use the `group_by()` and `summarise()` functions to see the minimum, maximum and mean closing prices for the two company. Use the ` %>% ` pipe operator! You should get something similar as below.

solution:
```{r }
stock_tidy %>% 
    group_by(company) %>% 
    summarise(min = min(stock_closing), max = max(stock_closing), mean = mean(stock_closing))
```

> (5) Finally, use `ggplot2` to visualise the closing prices of the two company. Use the `geom_line()`! You should get a similar plot as below.

solution:
```{r }
p_stock <- ggplot(stock_tidy,
                  aes(date, stock_closing,
                      color = company))

p_stock +
    geom_line() +
    labs(x = "", y = "Prices (USD)",
         title = "Closing daily prices for MSFT and TSLA",
         subtitle = "Data from 2016-01-10 to 2018-01-10",
         caption = "source: Yahoo Finance")
    
```



## 2.2 Data merging and splitting

While working with data, you rarely have the luxury to have all the data you need in one place. Let's recreate our previous data frame (this is one way of merging two vectors as well).

```{r}
country <- c("Thailand", "Norway", "Colombia", "Canada", "Slovenia", "France")
pop <- c(68.7, 5.2, 47.8, 35.8, 2, 63.6)

df1 <- data.frame(country, pop)
```

It just so happens that we now have data on their inflation and GDP growth. 

```{r}
v1 <- runif(n = 6, min = 1, max = 3) # we use the runif() function to sample from a uniform distribution
v2 <- runif(n = 6, min = -3, max = 3)

```

We have multiple options. First, assuming the order of the values in our new valuables matches the original one, we can simply add the two column. Just be mindful that here we match by *position*. For this we use the `bind_cols` from the `dplyr` package.

```{r}
df2 <- df1 %>% 
    bind_cols(inflation = v1, gdp_pct = v2)

df2
```

If we have more observations (rows) we can slap them onto our dataframe as well.

```{r}
df3 <- data.frame(
    country = c("Iceland", "Suriname"),
    pop = c(1.2, 0.6),
    inflation = c(0.1, 8.9),
    gdp_pct = c(0.5, 4.2))

df3

```

We should use the `bind_rows` function from `dplyr`.

```{r}
df2 %>% bind_rows(df3)
```


> Quick excercise:  
(1) load additional data from "https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data2.csv" and perform the same reshaping with the `gather()` function as before.  
(2) then use `bind_rows()` to add the two new company to the previous `stock_tidy` data frame, by creating a new data frame, called `stock_tidy2`.  
(3) Use the %>% to pipe together your steps. You can use the `unique()` function to check how many unique value you have in a given column.

Solution:
```{r }
stock_tidy2 <- read_csv("https://rawgit.com/aakosm/r_basics_ecpr18/master/stock_data2.csv")

stock_tidy2 <- stock_tidy2 %>% 
    gather("AAPL.Close", "F.Close", key = "company", value = "stock_closing") %>% 
    bind_rows(stock_tidy)

unique(stock_tidy2$company)
```

```{r collapse=FALSE}
head(stock_tidy2, 5)

unique(stock_tidy2$company)
```

The corresponding `base R` commands are `cbind()` and `rbind()`.

# 3. quick detour: `base R` indexing and subsetting
(This section was adopted from Martin Mölder's last year course)

We have covered some parts of it already, but (if time permits) this section will give a short rundown of how to split datasets with `base R` indexing. While I think the `dplyr` approach that we will be using after this is more straightforward it is useful to at least know that there are multiple ways to access/subset/split datasets. Many online help that you'll see will either use the `dplyr` or the `base` approach and it is helpful if you can reverse engineer it.

To access a row or column in the data frame, remember that indexing in R works as `[row, column]`.

```{r}
# let's see the third row of our dataset
df2[3,]

# of course you can specify a vector of rows as well
df2[c(1:3),]

# or give individual columns if they don't happen to follow each other
df2[c(1,3),]

```




For splitting databases based on some conditions, let's remember our logical operators from before. Let's select all of the countries which are above 30 million in population. First, let's check how many values meet this criteria.

```{r}
df2$pop > 30 
```

This will give as a vector of logical values that has the same length as we have rows in the data set. And we can use that to select rows – the rows with `TRUE` are selected and the rows with `FALSE` are not. We just have to put the vector of logical values where the row index of the data object is.

```{r}
df2[df2$pop > 30, ]
```

We can apply this conditional logic to selecting columns as well. We use the `|` OR operator.

```{r collapse=FALSE}
names(df2) == "pop"

names(df2) == "gdp_pct"

vars <- names(df2) == "pop" | names(df2) == "gdp_pct"

vars

df2[,vars]

# or just use the names of your variables directly
df2[,c("pop", "gdp_pct")]
```