ps3.Rmd

---
title: "Problem Set 3"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE, fig.align = "center", include = FALSE) # set include = TRUE to see solutions in knitted RMarkdown
```

## MLB Batting Statistics

In this problem set, you will practice using the pipe `%>%` and grouped calculations with batting statistics taken from the Lahman dataset. We will compute the same statistics as we did in [Problem Set 2](ps2.html) and also standardize these statistics within year, year and league, and also historical era.

To load the batting data into R, we can do the following:

```{r batting load, include=TRUE}
library(tidyverse)
library(Lahman)
batting <- as_tibble(Batting)
```

Unfortunately, some statistics like hit-by-pitch (HBP) were not recorded in the earlier decades of baseball. 
In the tbl, these missing values are designated `NA`. To see some of these, we can run the following code:

```{r batting select, include=TRUE}
batting %>% 
  select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB, GIDP)
```

A common convention for dealing with the missing values when computing $\text{PA}$ is to replace the `NA` with a `0`. To do this, we can use the function `replace_na()` within a pipe as follows. 

```{r batting NA, include=TRUE}
batting <- 
  batting %>% 
  replace_na(list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
batting %>% select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB)
```

The syntax for `replace_na()` is a bit involved but the basic idea is you have to specify the value you want to replace each `NA`.
When using `replace_na()` it is very important to remember to include the `list(...)` bit. 

1. Load the Lahman data and run the above code to create the tbl `Batting`, which has replaced all of the `NA`' in the columns `IBB, HBP, SH, SF` and GIDP`. 

```{r pt1 q1 solution}
# solution
library(tidyverse)
library(Lahman)
batting <- as_tibble(Batting)

batting %>% 
  select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB, GIDP)

batting <- 
  batting %>% 
  replace_na(list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
batting %>% select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB)
```

2. Using the pipe `%>%`, `mutate()`, `filter()`, and `select()`, create a tbl `batting` by:

    * Adding columns (with `mutate()`) for plate appearances (PA), unintentional walks (uBB), singles (X1B), batting average (BA), on-base percentage (OBP), on-base plus slugging (OPS), and weighted On-Base Average (wOBA). Note that the formula for plate appearances is $\text{PA} = \text{AB} + \text{BB} + \text{HBP} + \text{SH} + \text{SF}.$ Formulae for the remaining statistics are given in [Problem Set 2](ps2.html).
    * Pulling out only those rows of players with at least 502 plate appearances **and** who played in either the AL or NL with `filter()`
    * Select the following columns: playerID, yearID, lgID, teamID, PA, BA, OBP, OPS, wOBA. <p/>

```{r pt1 q2 solution}
# solution
batting <-
  batting %>%
  replace_na(list(HBP = 0, SH = 0, SF = 0, IBB = 0)) %>%
  mutate(PA = AB + BB + HBP + SH + SF,
         X1B = H - X2B - X3B - HR,
         uBB = BB - IBB,
         BA = H/AB,
         OBP = (H + BB + HBP)/(AB + BB + HBP + SF),
         SLG = (X1B + 2*X2B + 3*X3B + 4*HR)/AB,
         OPS = OBP + SLG,
         wOBA = (0.687 * uBB + 0.718 * HBP + 0.81 * X1B + 1.256 * X2B + 1.594 * X3B+ 2.065 * HR)/
           (AB + uBB + SF + HBP)
         ) %>%
  filter(lgID %in% c('AL', 'NL') & PA >= 502.2) %>% 
  select(playerID, yearID, teamID, lgID, PA,BA, OBP, OPS, wOBA)
```


3. Standardize each of BA, OBP, OPS, and wOBA using data from all of the years. Name the columns containing these new standardized values `zBA_all`, `zOPB_all`, etc. *Remember, you must re-define the `standardize()` function we wrote in [Lecture 3](lecture3.html)* Who were the best and worst batters according to these four metrics?

```{r pt1 q3 solution}
# solution
standardize <- function(x){
  mu <- mean(x, na.rm = TRUE)
  sigma <- sd(x, na.rm = TRUE)
  return( (x- mu)/sigma)
}

batting <- batting %>%
  mutate(zBA_all = standardize(BA),
         zOBP_all = standardize(OBP),
         zOPS_all = standardize(OPS),
         zwOBA_all = standardize(wOBA))

batting %>% 
  arrange(zBA_all) %>%
  head(5)

batting %>% 
  arrange(desc(zBA_all)) %>%
  head(5)

batting %>% 
  arrange(zOBP_all) %>%
  head(5)

batting %>% 
  arrange(desc(zOBP_all)) %>%
  head(5)

batting %>% 
  arrange(zOPS_all) %>%
  head(5)

batting %>% 
  arrange(desc(zOPS_all)) %>%
  head(5)

batting %>% 
  arrange(zwOBA_all) %>%
  head(5)

batting %>% 
  arrange(desc(zwOBA_all)) %>%
  head(5)
```

4. Group `batting` by year and compute the standardized BA, OBP, OPS, and wOBA within each year. Name the columns containing these new standardized values `zBA_year`, `zOBP_year`, etc. Now who are the best and worst batters according to the four measures?

```{r pt1 q4 solution}
# solution
batting <- batting %>%
  group_by(yearID) %>%
  mutate(zBA_year = standardize(BA),
         zOBP_year = standardize(OBP),
         zOPS_year = standardize(OPS),
         zwOBA_year = standardize(wOBA))

batting %>% 
  arrange(zBA_all) %>%
  head(5)

batting %>% 
  arrange(desc(zBA_all)) %>%
  head(5)

batting %>% 
  arrange(zOBP_all) %>%
  head(5)

batting %>% 
  arrange(desc(zOBP_all)) %>%
  head(5)

batting %>% 
  arrange(zOPS_all) %>%
  head(5)

batting %>% 
  arrange(desc(zOPS_all)) %>%
  head(5)

batting %>% 
  arrange(zwOBA_all) %>%
  head(5)

batting %>% 
  arrange(desc(zwOBA_all)) %>%
  head(5)
```


5. Remove the grouping by year and instead group by year and league. Once again, standardize OBP, OPS, and wOBA within each league-year combination, and name the columns containing these new standardized values `zBA_year_lg`, `zOBP_year_lg`, etc. Are the best and worst batters still the same?

```{r pt1 q5 solution}
# solution
batting <- 
  batting %>%
  ungroup() %>% 
  group_by(yearID, lgID) %>%
  mutate(zBA_year_lg = standardize(BA),
         zOBP_year_lg = standardize(OBP),
         zOPS_year_lg = standardize(OPS),
         zwOBA_year_lg = standardize(wOBA))

batting %>% 
  arrange(zBA_all) %>%
  head(5)

batting %>% 
  arrange(desc(zBA_all)) %>%
  head(5)

batting %>% 
  arrange(zOBP_all) %>%
  head(5)

batting %>% 
  arrange(desc(zOBP_all)) %>%
  head(5)

batting %>% 
  arrange(zOPS_all) %>%
  head(5)

batting %>% 
  arrange(desc(zOPS_all)) %>%
  head(5)

batting %>% 
  arrange(zwOBA_all) %>%
  head(5)

batting %>% 
  arrange(desc(zwOBA_all)) %>%
  head(5)
```

6. Remove the grouping you created in Problem 4. Bill James [divided baseball history into several eras](https://www.billjamesonline.com/dividing_baseball_history_into_eras/) as follows:
    * Pioneer Era: 1871 -- 1892
    * Spitball Era: 1893 -- 1919
    * Landis Era: 1920 -- 1946
    * Baby Boomer Era: 1947 -- 1968
    * Artifical Turf Era: 1969 -- 1992
    * Camden Yards Era: 1993 -- present <p/> 
    
    Use `mutate()` and `case_when()` (just like we did in [Lecture 2](lecture2.html)) to add a column called Hist_era to `batting` that records the historical era.

```{r pt1 q6 solution}
# solution
batting <- batting %>%
  ungroup() %>%
  mutate(HIST_ERA = case_when(
    1871 <= yearID & yearID <= 1892 ~ "Pioneer",
    1893 <= yearID & yearID <= 1919 ~ "Spitball",
    1920 <= yearID & yearID <= 1946 ~ "Landis",
    1947 <= yearID & yearID <= 1968 ~ "Baby Boomer",
    1969 <= yearID & yearID <= 1992 ~ "Artifical Turf",
    1993 <= yearID ~ "Camden Yards"))
```

7. Group `batting` by Hist_era and standardize BA, OBP, OPS, and wOBA within historical era. Who are the best and worst batters now? Name the columns containing these new standardized values `zBA_hist`, `zOBP_hist`, etc.

```{r pt1 q7 solution}
# solution
batting <- batting %>%
  group_by(HIST_ERA) %>%
  mutate(zBA_hist = standardize(BA),
         zOBP_hist = standardize(OBP),
         zOPS_hist = standardize(OPS),
         zwOBA_hist = standardize(wOBA))

batting %>% 
  arrange(zBA_hist) %>%
  head(5)

batting %>% 
  arrange(desc(zBA_hist)) %>%
  head(5)

batting %>% 
  arrange(zOBP_hist) %>%
  head(5)

batting %>% 
  arrange(desc(zOBP_hist)) %>%
  head(5)

batting %>% 
  arrange(zOPS_hist) %>%
  head(5)

batting %>% 
  arrange(desc(zOPS_hist)) %>%
  head(5)

batting %>% 
  arrange(zwOBA_hist) %>%
  head(5)

batting %>% 
  arrange(desc(zwOBA_hist)) %>%
  head(5)
```

8. Remove the grouping you added in Problem 6.

```{r pt1 q8 solution}
# solution
batting <- 
  batting %>%
  ungroup()
```


## MLB Payroll and Winnings

Recall from [Problem Set 2](ps2.html), we plotted the relative payroll of MLB teams against their winning percentage. 
In that problem set, we read in a file that had included the relative payroll for each team as a separate column.
To get some additional practice with dplyr, we will read in a different dataset and re-compute these relative payrolls.

1. Remember the function we wrote in [Lecture 3](lecture3.html) to standardize various statistics? It is reproduced below:

```{r standardize fn, include=TRUE}
standardize <- function(x){
  mu <- mean(x, na.rm = TRUE)
  sigma <- sd(x, na.rm = TRUE)
  return( (x - mu)/sigma)
}
```

* We need to write another function in order to compute "relative payroll". This function will take in a vector `x`, compute its median, and then divides every element of `x` by the median.

```{r relative fn}
# solution
relative <- function(x){
  med <- median(x, na.rm = TRUE)
  return( x/med )  
}
```


2. Read in the [MLB Payroll Data](data/mlb_payrolls.csv) and load it into a tibble called `mlb_payrolls`.

```{r pt2 q2 solution}
# solution
mlb_payrolls <- read_csv(file = "data/mlb_payrolls.csv")
```

3. Using the pipe `%>%`, `group_by()`, and `mutate()`, add a column to `mlb_payrolls` that contains the relative payroll for each team.

```{r pt2 q3 solution}
# solution
mlb_payrolls <- 
  mlb_payrolls %>%
  group_by(Year) %>%
  mutate(Relative_Payroll = relative(Team_Payroll))
```

4. Make a scatterplot of winning percentage against relative payrolls. Comment on the relationship. Your scatterplot should be identical to one you made in [Problem Set 2](ps2.html).

```{r pt2 q4 solution}
# solution
ggplot(data = mlb_payrolls) +
  geom_point(mapping = aes(x = Relative_Payroll, y = Winning_Percentage))
```

5. Using the `summarize()` function, compute the average team payroll and relative payroll for each year. Save these results in a new tbl called `payroll_avg`. 

```{r pt2 q5 solution}
# solution
payroll_avg <-
  mlb_payrolls %>% 
  summarize(avg_team_payroll = mean(Team_Payroll), 
            avg_team_relative_payroll = mean(Relative_Payroll))
```


6. Make a scatterplot that shows how team payrolls have evolved over the year. Similar to what we did in [Lecture 3](lecture3.html), add a line to this scatterplot that shows the average team payroll. Do the same thing for relative payroll. What do you notice about the average team payroll and relative payroll? 

```{r pt2 q6 solution}
# solution
ggplot(data = mlb_payrolls) +
  geom_point(mapping = aes(x = Year, y = Team_Payroll)) +
  geom_line(data = payroll_avg, mapping = aes(x = Year, y = avg_team_payroll), col = 'red')

ggplot(data = mlb_payrolls) +
  geom_point(mapping = aes(x = Year, y = Relative_Payroll)) +
  geom_line(data = payroll_avg, mapping = aes(x = Year, y = avg_team_relative_payroll), col = 'red')
```


7. As you will see in coming lectures, correlation is a measure of the strength of the *linear* relationship between two variables. The closer to +1 or -1 the correlation between two variables is, the more predictable they are of each other. We can compute it using the `cor()` function. Using `summary()` and `cor()`, compute the correlation between relative payroll and winning percentage within each year. What do you notice about how the relationship between winning percentage and relative payroll changes year to year? 

```{r pt2 q7 solution}
# solution
mlb_payrolls %>%
  summarize(cor = cor(Winning_Percentage, Relative_Payroll))
```

## A Challenge Question

Without running the code, work with your teammates to see if you can figure out what the code below is doing.

```{r challenge Q, include=TRUE, eval=FALSE}
batting_2014_2015 <-
  batting %>%
  filter(yearID %in% c(2014, 2015)) %>%
  group_by(playerID) %>%
  filter(n() == 2) %>% 
  select(playerID, yearID, BA) %>%
  arrange(playerID)
```

Now, run the code above and save the tbl `batting_2014_2015.RData` to the file "data/batting_2014_2015.RData" using the `save()` function. We will return to this dataset in [Lecture 4](lecture4.html).

```{r challenge Q solution}
# solution
batting_2014_2015 <-
  batting %>%
  filter(yearID %in% c(2014, 2015)) %>%
  group_by(playerID) %>%
  filter(n() == 2) %>% 
  select(playerID, yearID, BA) %>%
  arrange(playerID)
#save(batting_2014_2015, file = 'data/batting_2014_2015.RData')
```