-
Notifications
You must be signed in to change notification settings - Fork 0
/
cfbscrapR_tutorial.Rmd
216 lines (170 loc) · 9.12 KB
/
cfbscrapR_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "Rolling EPA Graph"
output: github_document
editor_options:
chunk_output_type: console
---
```{r}
library(tidyverse)
library(cfbscrapR)
library(zoo)
library(ggimage)
```
# Pull the play by play data
In the interest of time, I’m going to avoid a fresh call to the API here and I’m going to use a copy of the play by play data that is saved on github. This method only takes about 30 seconds. The result will be a dataframe that contains ALL of the play by play data for ALL teams in 2020.
```{r}
t <- tempfile()
download.file("https://raw.githubusercontent.com/saiemgilani/cfbscrapR-data/master/data/rds/pbp_players_pos_2020.rds",t)
pbp_2020 <- readRDS(t)
```
# Pull team info and team colors
We can use the `cfb_team_info()` function to pull information about each FBS school including their logo, color, and abbreviation. We’re going to take a subset of the cfb_team_info data and clean it up for use in our graphs later on.
```{r}
team_info = cfb_team_info(year = 2020)
team_colors_logos = team_info %>%
select(school, abbreviation, color, logos, alt_color) %>%
unnest(logos) %>%
group_by(school) %>%
slice(1) %>%
ungroup()
```
# Basic overview of all FBS Offenses
Let’s create a basic overview of offensive EPA per play. We’ll start by creating a dataframe listing all 130 FBS teams from best to worst based on their average EPA per play across all games played.
```{r}
off_epa = pbp_2020 %>%
filter(rush == 1 | pass == 1) %>%
group_by(offense_play, offense_conference) %>%
summarize(off_epa = mean(EPA, na.rm = TRUE)) %>%
arrange(desc(off_epa)) %>%
rename(Team = offense_play) %>%
filter(!is.na(offense_conference)) %>%
ungroup() %>%
mutate(Rank = row_number()) %>%
mutate(TeamRank = paste0(Team, " #", Rank))
head(off_epa)
```
## Find best QB
Matt's addition:
If I take the best EPA pass offenses, then I should be able to find the best QBs in college.
```{r}
pass_epa = pbp_2020 %>%
filter(pass == 1) %>%
group_by(offense_play, offense_conference) %>%
summarize(pass_epa = mean(EPA, na.rm = TRUE)) %>%
arrange(desc(pass_epa)) %>%
rename(Team = offense_play) %>%
filter(!is.na(offense_conference)) %>%
ungroup() %>%
mutate(Rank = row_number()) %>%
mutate(TeamRank = paste0(Team, " #", Rank))
head(pass_epa)
```
# Visualize
Ok, now let’s start to visualize the data beyond the top few schools. Let’s graph all the teams that have a positive EPA on the year.
```{r}
off_epa %>%
filter(off_epa > 0) %>%
ggplot(aes(x = reorder(TeamRank, off_epa), y=off_epa)) +
geom_point(size = 3) +
coord_flip() +
theme_bw() +
ylab("Average EPA Per Play") + xlab("") +
labs(title = "Offensive EPA Per Play | Positive EPA Teams",
caption = "Chart by @cfbNate
Data from @CFB_Data via @cfbscrapr")
```
# Prepare the data for the moving average chart
So far, so good, but we came here to build a moving average chart! Let’s proceed by defining the team of interest and the moving average window by editing the cell block below. I like to use 100 plays when looking at a full season to smooth out as much of the variability as possible, but for a mid-season view 50 plays or even 25 plays might be more appropriate. Play around!
```{r}
team = "Memphis"
ma_plays = 100
```
Now that we have defined our team of interest, we can prepare the data for this specific team. We’re going to create a dataframe called team_off that is subsetted from our full season play by play data. This new dataframe will be specific to our team of interest and it will add fields for the moving average and the play count.
```{r}
team_off = pbp_2020 %>%
filter(offense_play == team) %>%
filter(rush == 1 | pass == 1) %>%
filter(!is.na(EPA)) %>%
mutate(cu_epa=cummean(EPA), #this field is not used in this vignette but it could be substituted later to graph the cumulative EPA
ma_epa=rollapply(EPA,ma_plays,mean,align='right',fill=NA),
play_count = row_number(),
week_team = paste0("WK", ifelse(week > 9, week, paste0(0,week)), " ", defense_play))
```
We’re going to graph different background tiles for each opponent so we need to define where one opponent stops and another one begins.
```{r}
team_off_play_start = team_off %>%
group_by(week_team) %>%
slice(1) %>%
select(defense_play, week_team, play_count) %>%
rename(play_start = play_count,
team = defense_play)
team_off_play_stop = team_off %>%
group_by(week_team) %>%
filter(row_number() == n()) %>%
select(week_team, play_count) %>%
rename(play_stop = play_count)
```
We’re going to add the opponent’s logo in the middle of each background tile, so we need to define the midpoint of each background tile as well.
```{r}
team_off_start_stop = team_off_play_start %>%
left_join(team_off_play_stop, by = "week_team") %>%
mutate(midpoint = (play_start + play_stop)/2)
```
We need to make a simple vector that says how many total plays are in our `team_off` dataframe.
```{r}
play_count = max(team_off$play_count)
```
Next we are going to update our background tile dataframe to include the color info that we pulled earlier. FCS color data is not available, so we will default those to gray.
```{r}
team_off_start_stop = team_off_start_stop %>%
left_join(team_colors_logos, by = c("team" = "school")) %>%
mutate(color = replace_na(color,"gray")) %>%
select(team, week_team, play_start, play_stop, midpoint, color)
```
Are we ready to graph yet? Just a few more steps! We need a named vector to get the right colors on those background tiles.
```{r}
team_colors <- as.character(team_off_start_stop$color)
names(team_colors) <- as.character(team_off_start_stop$team)
```
Now we need to pull in an alternate source of team logos that includes FCS schools. Then we’ll join that new table of logos with our background tile dataframe.
```{r}
all_logos = read_csv("https://raw.githubusercontent.com/natemanzo/cfb_data/master/_team_logos.csv")
team_off_start_stop = team_off_start_stop %>%
left_join(all_logos, by = c("team" = "school"))
```
Almost done! The last thing we need is your signature to go in the caption of the graph. You can use your twitter handle or your real name here.
```{r}
signature = "@flahermi"
```
# IT’S GRAPHIN’ TIME
The beginning of the line graph will depend on the ma_plays variable that you defined earlier. Dashed lines indicating the EPA/play of the 25th percentile, median, 75th percentile, and top and bottom teams in the country are drawn on the graph to show context for how the team of interest compares to the rest of the country.
```{r}
graph_team_off = ggplot() +
geom_rect(data = team_off_start_stop, aes(xmin = play_start, xmax = play_stop, fill = team, ymin = -.5, ymax = .9), color = "gray90") +
geom_rect(data = team_off_start_stop, aes(xmin = play_start, xmax = play_stop, ymin = .8, ymax = 1), color = "gray90", fill = "white") +
scale_fill_manual(values = team_colors) +
geom_hline(yintercept = quantile(off_epa$off_epa), linetype = 2, color = "gray20", alpha = .8) +
geom_hline(yintercept = 0, linetype = 1, color = "gray20", alpha = .2) +
geom_image(data=team_off_start_stop,aes(x=midpoint,y=.9,image=logo), asp = 16/9, size = .05) +
annotate(x = -2, y = quantile(off_epa$off_epa)[1], geom = "text", size = 3, hjust = "right", vjust = 0, label = off_epa %>% slice(n()) %>% pull(Team)) +
annotate(x = -2, y = quantile(off_epa$off_epa)[2], geom = "text", size = 3, hjust = "right", vjust = 0, label = "25%ile") +
annotate(x = -2, y = quantile(off_epa$off_epa)[3], geom = "text", size = 3, hjust = "right", vjust = 0, label = "Median") +
annotate(x = -2, y = quantile(off_epa$off_epa)[4], geom = "text", size = 3, hjust = "right", vjust = 0, label = "75%ile") +
annotate(x = -2, y = quantile(off_epa$off_epa)[5], geom = "text", size = 3, hjust = "right", vjust = 0, label = off_epa %>% slice(1) %>% pull(Team)) +
geom_line(data = team_off, aes(x = play_count, y = ma_epa), color = "white", size = 2) +
geom_line(data = team_off, aes(x = play_count, y = ma_epa), size = 1.25) +
theme_minimal() + theme(panel.grid = element_blank()) + theme(legend.position = "none") +
ylab("EPA") + xlab("Number of Plays") +
labs(title = paste0(team," Offensive EPA | ",ma_plays,"-Play Moving Average"),
caption = paste0("Chart by ",signature," using code from @cfbNate
Data from @CFB_Data via @cfbscrapr")) +
coord_cartesian(xlim = c(-20, play_count), # This leaves room for the labels over the dashed lines
clip = 'off') # This keeps the labels from disappearing
graph_team_off
```
Lastly, you can save your graph to your working directory by running the line below. PC users will notice the saved graphic will have a higher resolution and less pixelization than the graphic generated above.
Reminder: You can always use `getwd()` to find out what your current working directory is if you’re not sure.
```{r}
ggsave(graph_team_off, filename = paste0("off_epa_rolling_",team,".png"),
dpi = 300, type = "cairo", width = 10, height = 7, units = "in")
```