-
Notifications
You must be signed in to change notification settings - Fork 7
/
ps3.Rmd
413 lines (320 loc) · 12.6 KB
/
ps3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "Problem Set 3"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE, fig.align = "center", include = FALSE) # set include = TRUE to see solutions in knitted RMarkdown
```
## MLB Batting Statistics
In this problem set, you will practice using the pipe `%>%` and grouped calculations with batting statistics taken from the Lahman dataset. We will compute the same statistics as we did in [Problem Set 2](ps2.html) and also standardize these statistics within year, year and league, and also historical era.
To load the batting data into R, we can do the following:
```{r batting load, include=TRUE}
library(tidyverse)
library(Lahman)
batting <- as_tibble(Batting)
```
Unfortunately, some statistics like hit-by-pitch (HBP) were not recorded in the earlier decades of baseball.
In the tbl, these missing values are designated `NA`. To see some of these, we can run the following code:
```{r batting select, include=TRUE}
batting %>%
select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB, GIDP)
```
A common convention for dealing with the missing values when computing $\text{PA}$ is to replace the `NA` with a `0`. To do this, we can use the function `replace_na()` within a pipe as follows.
```{r batting NA, include=TRUE}
batting <-
batting %>%
replace_na(list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
batting %>% select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB)
```
The syntax for `replace_na()` is a bit involved but the basic idea is you have to specify the value you want to replace each `NA`.
When using `replace_na()` it is very important to remember to include the `list(...)` bit.
1. Load the Lahman data and run the above code to create the tbl `Batting`, which has replaced all of the `NA`' in the columns `IBB, HBP, SH, SF` and GIDP`.
```{r pt1 q1 solution}
# solution
library(tidyverse)
library(Lahman)
batting <- as_tibble(Batting)
batting %>%
select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB, GIDP)
batting <-
batting %>%
replace_na(list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
batting %>% select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB)
```
2. Using the pipe `%>%`, `mutate()`, `filter()`, and `select()`, create a tbl `batting` by:
* Adding columns (with `mutate()`) for plate appearances (PA), unintentional walks (uBB), singles (X1B), batting average (BA), on-base percentage (OBP), on-base plus slugging (OPS), and weighted On-Base Average (wOBA). Note that the formula for plate appearances is $\text{PA} = \text{AB} + \text{BB} + \text{HBP} + \text{SH} + \text{SF}.$ Formulae for the remaining statistics are given in [Problem Set 2](ps2.html).
* Pulling out only those rows of players with at least 502 plate appearances **and** who played in either the AL or NL with `filter()`
* Select the following columns: playerID, yearID, lgID, teamID, PA, BA, OBP, OPS, wOBA. <p/>
```{r pt1 q2 solution}
# solution
batting <-
batting %>%
replace_na(list(HBP = 0, SH = 0, SF = 0, IBB = 0)) %>%
mutate(PA = AB + BB + HBP + SH + SF,
X1B = H - X2B - X3B - HR,
uBB = BB - IBB,
BA = H/AB,
OBP = (H + BB + HBP)/(AB + BB + HBP + SF),
SLG = (X1B + 2*X2B + 3*X3B + 4*HR)/AB,
OPS = OBP + SLG,
wOBA = (0.687 * uBB + 0.718 * HBP + 0.81 * X1B + 1.256 * X2B + 1.594 * X3B+ 2.065 * HR)/
(AB + uBB + SF + HBP)
) %>%
filter(lgID %in% c('AL', 'NL') & PA >= 502.2) %>%
select(playerID, yearID, teamID, lgID, PA,BA, OBP, OPS, wOBA)
```
3. Standardize each of BA, OBP, OPS, and wOBA using data from all of the years. Name the columns containing these new standardized values `zBA_all`, `zOPB_all`, etc. *Remember, you must re-define the `standardize()` function we wrote in [Lecture 3](lecture3.html)* Who were the best and worst batters according to these four metrics?
```{r pt1 q3 solution}
# solution
standardize <- function(x){
mu <- mean(x, na.rm = TRUE)
sigma <- sd(x, na.rm = TRUE)
return( (x- mu)/sigma)
}
batting <- batting %>%
mutate(zBA_all = standardize(BA),
zOBP_all = standardize(OBP),
zOPS_all = standardize(OPS),
zwOBA_all = standardize(wOBA))
batting %>%
arrange(zBA_all) %>%
head(5)
batting %>%
arrange(desc(zBA_all)) %>%
head(5)
batting %>%
arrange(zOBP_all) %>%
head(5)
batting %>%
arrange(desc(zOBP_all)) %>%
head(5)
batting %>%
arrange(zOPS_all) %>%
head(5)
batting %>%
arrange(desc(zOPS_all)) %>%
head(5)
batting %>%
arrange(zwOBA_all) %>%
head(5)
batting %>%
arrange(desc(zwOBA_all)) %>%
head(5)
```
4. Group `batting` by year and compute the standardized BA, OBP, OPS, and wOBA within each year. Name the columns containing these new standardized values `zBA_year`, `zOBP_year`, etc. Now who are the best and worst batters according to the four measures?
```{r pt1 q4 solution}
# solution
batting <- batting %>%
group_by(yearID) %>%
mutate(zBA_year = standardize(BA),
zOBP_year = standardize(OBP),
zOPS_year = standardize(OPS),
zwOBA_year = standardize(wOBA))
batting %>%
arrange(zBA_all) %>%
head(5)
batting %>%
arrange(desc(zBA_all)) %>%
head(5)
batting %>%
arrange(zOBP_all) %>%
head(5)
batting %>%
arrange(desc(zOBP_all)) %>%
head(5)
batting %>%
arrange(zOPS_all) %>%
head(5)
batting %>%
arrange(desc(zOPS_all)) %>%
head(5)
batting %>%
arrange(zwOBA_all) %>%
head(5)
batting %>%
arrange(desc(zwOBA_all)) %>%
head(5)
```
5. Remove the grouping by year and instead group by year and league. Once again, standardize OBP, OPS, and wOBA within each league-year combination, and name the columns containing these new standardized values `zBA_year_lg`, `zOBP_year_lg`, etc. Are the best and worst batters still the same?
```{r pt1 q5 solution}
# solution
batting <-
batting %>%
ungroup() %>%
group_by(yearID, lgID) %>%
mutate(zBA_year_lg = standardize(BA),
zOBP_year_lg = standardize(OBP),
zOPS_year_lg = standardize(OPS),
zwOBA_year_lg = standardize(wOBA))
batting %>%
arrange(zBA_all) %>%
head(5)
batting %>%
arrange(desc(zBA_all)) %>%
head(5)
batting %>%
arrange(zOBP_all) %>%
head(5)
batting %>%
arrange(desc(zOBP_all)) %>%
head(5)
batting %>%
arrange(zOPS_all) %>%
head(5)
batting %>%
arrange(desc(zOPS_all)) %>%
head(5)
batting %>%
arrange(zwOBA_all) %>%
head(5)
batting %>%
arrange(desc(zwOBA_all)) %>%
head(5)
```
6. Remove the grouping you created in Problem 4. Bill James [divided baseball history into several eras](https://www.billjamesonline.com/dividing_baseball_history_into_eras/) as follows:
* Pioneer Era: 1871 -- 1892
* Spitball Era: 1893 -- 1919
* Landis Era: 1920 -- 1946
* Baby Boomer Era: 1947 -- 1968
* Artifical Turf Era: 1969 -- 1992
* Camden Yards Era: 1993 -- present <p/>
Use `mutate()` and `case_when()` (just like we did in [Lecture 2](lecture2.html)) to add a column called Hist_era to `batting` that records the historical era.
```{r pt1 q6 solution}
# solution
batting <- batting %>%
ungroup() %>%
mutate(HIST_ERA = case_when(
1871 <= yearID & yearID <= 1892 ~ "Pioneer",
1893 <= yearID & yearID <= 1919 ~ "Spitball",
1920 <= yearID & yearID <= 1946 ~ "Landis",
1947 <= yearID & yearID <= 1968 ~ "Baby Boomer",
1969 <= yearID & yearID <= 1992 ~ "Artifical Turf",
1993 <= yearID ~ "Camden Yards"))
```
7. Group `batting` by Hist_era and standardize BA, OBP, OPS, and wOBA within historical era. Who are the best and worst batters now? Name the columns containing these new standardized values `zBA_hist`, `zOBP_hist`, etc.
```{r pt1 q7 solution}
# solution
batting <- batting %>%
group_by(HIST_ERA) %>%
mutate(zBA_hist = standardize(BA),
zOBP_hist = standardize(OBP),
zOPS_hist = standardize(OPS),
zwOBA_hist = standardize(wOBA))
batting %>%
arrange(zBA_hist) %>%
head(5)
batting %>%
arrange(desc(zBA_hist)) %>%
head(5)
batting %>%
arrange(zOBP_hist) %>%
head(5)
batting %>%
arrange(desc(zOBP_hist)) %>%
head(5)
batting %>%
arrange(zOPS_hist) %>%
head(5)
batting %>%
arrange(desc(zOPS_hist)) %>%
head(5)
batting %>%
arrange(zwOBA_hist) %>%
head(5)
batting %>%
arrange(desc(zwOBA_hist)) %>%
head(5)
```
8. Remove the grouping you added in Problem 6.
```{r pt1 q8 solution}
# solution
batting <-
batting %>%
ungroup()
```
## MLB Payroll and Winnings
Recall from [Problem Set 2](ps2.html), we plotted the relative payroll of MLB teams against their winning percentage.
In that problem set, we read in a file that had included the relative payroll for each team as a separate column.
To get some additional practice with dplyr, we will read in a different dataset and re-compute these relative payrolls.
1. Remember the function we wrote in [Lecture 3](lecture3.html) to standardize various statistics? It is reproduced below:
```{r standardize fn, include=TRUE}
standardize <- function(x){
mu <- mean(x, na.rm = TRUE)
sigma <- sd(x, na.rm = TRUE)
return( (x - mu)/sigma)
}
```
* We need to write another function in order to compute "relative payroll". This function will take in a vector `x`, compute its median, and then divides every element of `x` by the median.
```{r relative fn}
# solution
relative <- function(x){
med <- median(x, na.rm = TRUE)
return( x/med )
}
```
2. Read in the [MLB Payroll Data](data/mlb_payrolls.csv) and load it into a tibble called `mlb_payrolls`.
```{r pt2 q2 solution}
# solution
mlb_payrolls <- read_csv(file = "data/mlb_payrolls.csv")
```
3. Using the pipe `%>%`, `group_by()`, and `mutate()`, add a column to `mlb_payrolls` that contains the relative payroll for each team.
```{r pt2 q3 solution}
# solution
mlb_payrolls <-
mlb_payrolls %>%
group_by(Year) %>%
mutate(Relative_Payroll = relative(Team_Payroll))
```
4. Make a scatterplot of winning percentage against relative payrolls. Comment on the relationship. Your scatterplot should be identical to one you made in [Problem Set 2](ps2.html).
```{r pt2 q4 solution}
# solution
ggplot(data = mlb_payrolls) +
geom_point(mapping = aes(x = Relative_Payroll, y = Winning_Percentage))
```
5. Using the `summarize()` function, compute the average team payroll and relative payroll for each year. Save these results in a new tbl called `payroll_avg`.
```{r pt2 q5 solution}
# solution
payroll_avg <-
mlb_payrolls %>%
summarize(avg_team_payroll = mean(Team_Payroll),
avg_team_relative_payroll = mean(Relative_Payroll))
```
6. Make a scatterplot that shows how team payrolls have evolved over the year. Similar to what we did in [Lecture 3](lecture3.html), add a line to this scatterplot that shows the average team payroll. Do the same thing for relative payroll. What do you notice about the average team payroll and relative payroll?
```{r pt2 q6 solution}
# solution
ggplot(data = mlb_payrolls) +
geom_point(mapping = aes(x = Year, y = Team_Payroll)) +
geom_line(data = payroll_avg, mapping = aes(x = Year, y = avg_team_payroll), col = 'red')
ggplot(data = mlb_payrolls) +
geom_point(mapping = aes(x = Year, y = Relative_Payroll)) +
geom_line(data = payroll_avg, mapping = aes(x = Year, y = avg_team_relative_payroll), col = 'red')
```
7. As you will see in coming lectures, correlation is a measure of the strength of the *linear* relationship between two variables. The closer to +1 or -1 the correlation between two variables is, the more predictable they are of each other. We can compute it using the `cor()` function. Using `summary()` and `cor()`, compute the correlation between relative payroll and winning percentage within each year. What do you notice about how the relationship between winning percentage and relative payroll changes year to year?
```{r pt2 q7 solution}
# solution
mlb_payrolls %>%
summarize(cor = cor(Winning_Percentage, Relative_Payroll))
```
## A Challenge Question
Without running the code, work with your teammates to see if you can figure out what the code below is doing.
```{r challenge Q, include=TRUE, eval=FALSE}
batting_2014_2015 <-
batting %>%
filter(yearID %in% c(2014, 2015)) %>%
group_by(playerID) %>%
filter(n() == 2) %>%
select(playerID, yearID, BA) %>%
arrange(playerID)
```
Now, run the code above and save the tbl `batting_2014_2015.RData` to the file "data/batting_2014_2015.RData" using the `save()` function. We will return to this dataset in [Lecture 4](lecture4.html).
```{r challenge Q solution}
# solution
batting_2014_2015 <-
batting %>%
filter(yearID %in% c(2014, 2015)) %>%
group_by(playerID) %>%
filter(n() == 2) %>%
select(playerID, yearID, BA) %>%
arrange(playerID)
#save(batting_2014_2015, file = 'data/batting_2014_2015.RData')
```