-
Notifications
You must be signed in to change notification settings - Fork 91
/
Copy path09-wt-aggregate-data.Rmd
600 lines (459 loc) · 34.3 KB
/
09-wt-aggregate-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
# Walkthrough 3: Using school-level aggregate data to illuminate educational inequities {#c09}
**Abstract**
This chapter explores cleaning, tidying, joining, and visualizing publicly available, aggregate data. Data scientists in education frequently work with public aggregate data when student-level data is not available. By working with these data, they can discover broader trends and underlying patterns. If aggregate data is disaggregated by subgroups or subpopulations, data scientists can reveal areas of inequity for marginalized populations.
This chapter uses a freely available school district dataset to examine the distribution of students in the district by race and socioeconomic status. Subgroup analysis can reveal the state of equity in a system and inform student improvement. Data science techniques in this chapter include reading tables from an online PDF into a machine-readable format, preparing data for analysis, transforming it into a tidy format, visualizing it, and analyzing distributions and relationships.
## Topics emphasized
- Importing data
- Tidying data
- Transforming data
- Visualizing data
## Functions introduced
- `tabulapdf::extract_tables()`
- `readRDS()`
- `purrr::map`
- `purrr::list_rbind()`
- `purrr::set_names()`
- `dplyr::slice()`
- `dplyr::across()`
- `trimws()`
- `stringr::str_replace()`
- `janitor::adorn_totals()`
## Vocabulary
* aggregate data
* disaggregated data
* data frame
* Free/Reduced Price Lunch (FRPL)
* histogram
* lists
* subgroup
* trim
* weighted average
## Chapter overview
Data scientists in education don't always have access to student-level data, so knowing how to analyze aggregate datasets is very valuable. This chapter explores aggregate data and how to access, clean, and explore it. It is a "companion" to [Chapter 10](#c10), which also explores aggregate data but does so with an emphasis on "longitudinal analyses"---analyses that involve data at more than one point in time.
### Background
A common situation encountered when searching for education data, particularly by analysts who are not directly working with schools or districts, is the prevalence of publicly available aggregate data. Aggregate data is information that has the following characteristics:
1. collected from multiple sources and/or on multiple measures, variables, or individuals and
2. compiled into data summaries or summary reports, typically for the purposes of public reporting or statistical analysis [@greatschools2014]
Examples of publicly available aggregate data include school-level graduation rates, state test proficiency scores by grade and subject, or mean survey responses. In this walkthrough, you'll explore the role of aggregate data with a focus on educational equity.
Aggregate data is essential both for accountability and for providing useful information about schools and districts. For example, district administrators might aggregate student-level enrollment reports over time. They can use this information to see how many students enroll in each school, in the district overall, and at various grade levels. Depending on the state, the district administrator might submit these aggregate data to their state education agency (SEA) for reporting purposes. These datasets might be posted on the state's Department of Education website for anyone to download and use.
Federal and international education datasets provide additional information. In the US, some federal datasets aim to consolidate important metrics from all states. These consolidated reports can be useful because each state has its own repository of data. Going through each state's website to download a particular metric is a significant effort. However, the federal datasets often have more stringent data requirements than the states, so the datasets may be less usable.
Data scientists in education can analyze these reports and datasets to answer questions related to their field of interest, but doing so is not always straightforward. Publicly available aggregate datasets are large and often suppressed to protect privacy. Because of their coarseness, they can be difficult to interpret and use. Sometimes, agencies delay release of the dataset by multiple years.
Generally, aggregate data is used to surface broader trends and patterns in education instead of diagnosing underlying issues or making causal statements. Data scientists must consider the limitations of aggregate data before analyzing it.
Despite these limitations, the wide availability of aggregated data has its benefits. Analysis of aggregate data helps educators identify patterns that are harder to see in individual-level data. Data scientists in education can use this new insight to create research questions, craft hypotheses, and make recommendations on how to improve education systems.
This chapter and the [following one](#c10) provide two examples of cleaning aggregate data and using these datasets to compare student experiences. In this chapter, you'll focus on educational equity by identifying and comparing patterns in student demographic groups. In the next chapter, you'll compare student counts over time in different states.
#### What is the difference between aggregate and student-level data?
Let’s explore the differences between aggregate and student-level data. Publicly available data---like the data you'll use in this walkthrough---summarizes student-level data. This means that student-level data is totaled to protect students' identities before being made publicly available. We can use R to demonstrate this concept.
Here are rows in an example student-level dataset:
```{r student level data example, message = F}
library(tidyverse)
# Create student-level data
tibble(
student = letters[1:10],
school = rep(letters[11:15], 2),
test_score = sample(0:100, 10, replace = TRUE)
)
```
Run the next code chunk, and you'll see that aggregate data totals up a variable---the variable `test_score` in this case---to "roll up" the student-level information. Each row of the resulting dataset represents a group. The group in this example is `school`, which you'll tell R to use by putting it in `group_by()`. Then, you'll calculate the average score by school with `summarize()` and `mean()`:
```{r aggregate level data example}
tibble(
student = letters[1:10],
school = rep(letters[11:15], 2),
test_score = sample(0:100, 10, replace = TRUE)
) %>%
# Aggregate by school
group_by(school) %>%
# Calculate average score by school
summarize(mean_score = mean(test_score))
```
Notice that this dataset no longer identifies individual students.
#### Disaggregating aggregated data
In this walkthrough, you'll encounter the term "disaggregated." In the context of the walkthrough, this means disaggregating to a smaller group but not to the individual level.
Aggregated data can tell you many things, but to better examine groups that share similar characteristics, you must have data disaggregated by those groups. This data is still aggregated at the row level but provides information on smaller components than the grand total [@disaggregate]. Common disaggregations for students include gender, race and ethnicity, socioeconomic status, English learner designation, and whether they are served under the Individuals with Disabilities Education Act (IDEA) [@subgroup].
#### Disaggregating data and equity
Disaggregated data is essential for understanding equity in educational resources and outcomes. Without disaggregating data by student characteristics, educators are unable to understand how different groups of students are doing and what support they need. With disaggregated data, educators can identify what solutions are required in order to solve disparities in opportunity, resources, and treatment.
## Data sources
There are many publicly available aggregate datasets related to education. On the international level, perhaps the most well-known is PISA:
- [Programme for International Student Assessment (PISA)](https://www.oecd.org/en/about/programmes/pisa.html) (https[]()://www.oecd.org/pisa/), which measures 15-year-old school pupils' academic performance in mathematics, science, and reading.
On the U.S. federal level, well-known examples include:
- [Civil Rights Data Collection (CRDC)](https://www2.ed.gov/about/offices/list/ocr/data.html) (https[]()://www2.ed.gov/about/offices/list/ocr/data.html), which reports variables on educational programs and services disaggregated by race and ethnicity, sex, limited English proficiency, and disability. These data are at the school level.
- [Common Core of Data (CCD)](https://nces.ed.gov/ccd/) (https[]()://www2.ed.gov/about/offices/list/ocr/data.html), which is the U.S. Department of Education's primary database on public elementary and secondary education.
- [EdFacts](https://www2.ed.gov/about/inits/ed/edfacts/data-files/index.html) (https[]()://www2.ed.gov/about/inits/ed/edfacts/data-files/index.html), which includes state assessments and adjusted cohort graduation rates. These data are at the school and district level.
- [Integrated Postsecondary Education Data System (IPEDS)](https://nces.ed.gov/ipeds/) (https[]()://nces.ed.gov/ipeds/), which is the U.S. Department of Education's primary database on postsecondary education.
- [National Assessment for Educational Progress (NAEP) Data](https://nces.ed.gov/nationsreportcard/researchcenter/datatools.aspx) (https[]()://nces.ed.gov/nationsreportcard/researchcenter/datatools.aspx), which is an assessment of educational progress in the United States. Often called the "nation's report card", the NAEP reading and mathematics assessments are administered to a representative sample of fourth- and eighth-grade students in each state every two years.
At the U.S. state and district levels, two examples include:
- [California Department of Education](https://www.cde.ca.gov/ds/) (https[]()://www.cde.ca.gov/ds/), which is the state's Department of Education website. It includes both downloadable CSV files and "Data Quest", which lets you query their data online.
- [Minneapolis Public Schools](https://www.mpschools.org/about-mps/data-dashboard) (https[]()://mpschools.org/about-mps/data-dashboard), which is a school district website that displays data at the school level.
### Selecting data
For this walkthrough, you'll examine data from a school district. In the next, you'll zoom out and examine data across states in the United States.
When we published the first edition of this book in 2020, the school district provided comprehensive student demographic data in PDF format. They reported both the percentage and exact number of students in each subgroup. This detailed information allowed a thorough examination of individual school demographics and offered an excellent opportunity to explore systemic inequities.
The district has since changed its reporting process to a school-level dashboard. The PDFs we used in the first edition are no longer available on the district's website. This change in availability and access is a common occurrence in education. We cannot assume that data will remain available in the same format or location, or remain available at all.
Adapting to changes in data availability is crucial for a data scientist in education. You can implement data governance measures in your organization, save backups in an organized manner in accordance with privacy and security policies, and create a centralized repository for data-related resources. To learn more, we recommend Data Management in Large-Scale Education Research by Crystal Lewis [@lewis2024data].
To ensure consistent access to the PDFs for this chapter, we saved copies in our [GitHub repository](https://github.com/data-edu/data-science-in-education/tree/master/data/agg_data) (https[]()://github.com/data-edu/data-science-in-education/tree/master/data/agg_data). You can verify their original locations on the district website with the Wayback Machine:
* [Race/ethnicity data](https://web.archive.org/web/20201112024712/https://studentaccounting.mpls.k12.mn.us/uploads/mps_fall2018_racial_ethnic_by_school_by_grade.pdf) (https[]()://web.archive.org/web/20201112024712/https[]()://studentaccounting.mpls.k12.mn.us/uploads/mps_fall2018_racial_ethnic_by_school_by_grade.pdf)
* [FRPL data](https://web.archive.org/web/20220307174824/https://studentaccounting.mpls.k12.mn.us/uploads/fall_2018_meal_eligiblity_official.pdf) (https[]()://web.archive.org/web/20220307174824/https[]()//studentaccounting.mpls.k12.mn.us/uploads/fall_2018_meal_eligiblity_official.pdf)
### Methods
In this chapter, you'll practice analyzing data from a single district to help education data practitioners describe the district's needs and opportunities. You'll use descriptive analysis on an aggregate dataset to find out whether there is a phenomenon present, what it is, and what leaders can address through resources, reforms, or interventions [@descriptive]. Descriptive analysis is different from causal analysis, which aims to assess the root cause of a phenomenon or the effects of an intervention.
First, you'll learn how to import PDFs into R, clean and wrangle data, and visualize it to gain insights about the district.
## Load packages
As in past chapters, begin your script by calling the necessary packages. If you have not installed these packages yet, see the ["Packages" section](#c06p) of the ["Foundational Skills"](#c06) chapter.
```{r, message = F, warning = F}
library(tidyverse)
library(here)
library(janitor)
library(dataedu)
```
ROpenSci created the [{tabulapdf}](https://github.com/ropensci/tabulapdf) [@sepulveda] package. It provides R bindings to the Tabula PDF Table Extractor Library, which can be used to computationally extract tables from PDF documents. It is not available on CRAN, so install it by running the code below:
```{r, eval = FALSE}
install.packages(
"tabulapdf",
repos = c(
"https://ropensci.r-universe.dev",
"https://cloud.r-project.org"
)
)
```
{rJava} [@R-rJava] is a required package to load {tabulapdf}. If you're unable to install {rJava} or would like to go straight to the data processing, use Option 2 or Option 3 below to import the data.
```{r, eval = FALSE}
library(tabulapdf)
```
## Import data
You have three options for extracting the data from the PDFs. Pick one option, run the code, and then follow the data workflow to clean and process the data.
Option 1: Use {tabulapdf}, which pulls the PDF tables into lists using `extract_tables()`. The PDFs are available in the book's [GitHub repository](https://github.com/data-edu/data-science-in-education/tree/master/data/agg_data) (https[]()://github.com/data-edu/data-science-in-education/tree/master/data/agg_data).
Option 2: Get the imported but unprocessed data from the book's [GitHub repository](https://github.com/data-edu/data-science-in-education/tree/master/data/agg_data) (https[]()://github.com/data-edu/data-science-in-education/tree/master/data/agg_data). They are available in the files `race_pdf.Rds` and `frpl_pdf.Rds`. To set up the folders in your working directory as they are in this book, first, create a folder called `"data"`. Then, inside that folder, create a second folder called `"agg_data"` and place the `race_pdf.Rds` and `frpl_pdf.Rds` files in the `"agg_data"` folder. Then, load the data using `here()`. Alternatively, if you are using a different working directory, change the file path inside of `here()` to match where you store your data in your working directory.
Option 3: Get the imported but unprocessed data from the {dataedu} package.
Begin importing the data with the Race/Ethnicity tables. Import the PDF information using one of the three options:
```{r, eval = FALSE}
# Option 1: Get data using {tabulapdf}
race_pdf <-
extract_tables(
"https://github.com/data-edu/data-science-in-education/blob/main/data/agg_data/mps_fall2018_racial_ethnic_by_school_by_grade.pdf?raw=true"
)
# Option 2: Get data from the book repository
# The code below assumes you have set up folders data and agg_data within your working directory
race_pdf <-
readRDS(here("data", "agg_data", "race_pdf.Rds"))
# Option 3: Get data using {dataedu}
race_pdf <-
dataedu::race_pdf
```
Next, convert what you've imported into a data frame. The object `race_pdf` is a list. Before analyzing the data, you need to transform the list into a data frame by following these steps:
1. Turn `race_pdf` into a tibble by using `map(as_tibble())` from the {purrr} package.
2. Use the `list_rbind()` function to turn these tibbles into a single data frame.
3. Use the `slice()` function to remove unnecessary rows.
4. Create readable column names using `set_names()`.
```{r, eval = FALSE}
race_df <-
race_pdf %>%
# Turn each page into a tibble
map( ~ as_tibble(.x, .name_repair = "unique")) %>%
# Make data frame
list_rbind() %>%
# Remove unnecessary rows
slice(-1:-2) %>%
# Use descriptive column names
set_names(
c(
"school_group",
"school_name",
"grade",
"na_num",
# Native American number of students
"na_pct",
# Native American percentage of students
"aa_num",
# African American number of students
"aa_pct",
# African American percentage
"as_num",
# Asian number of students
"as_pct",
# Asian percentage
"hi_num",
# Hispanic number of students
"hi_pct",
# Hispanic percentage
"wh_num",
# White number of students
"wh_pct",
# White percentage
"pi_pct",
# Pacific Islander percentage
"blank_col",
# Total number of students (from the Race PDF)
"tot"
)
)
```
For the Race/Ethnicity table, you want the totals for each district school because you won't be looking at grade-level variation. In the resulting data frame, the school totals have "Total" in `school_name`.
Next, clean the data further using these steps:
1. Remove unnecessary or blank columns using `select()`. Negative selections mean those columns will be removed.
2. Keep schools that have "Total" in the name but remove any rows that are "Grand Total" using `filter()`
3. Trim white space from character columns using `trimws()`.
4. The `percentage` columns contain percentage signs, meaning they were read in as characters. Turn the data in these columns into a numeric data type using `mutate()`.
Let's discuss that final step further: `mutate(across(matches("pct"), ~ as.numeric(str_replace(., "%", "")) / 100))`. Using this code, you are telling `mutate` to:
* Use `match()` with `across()` to select columns with names that match the regular expression "pct".
* For the rows in these columns, replace the character `"%"` with blanks `""` by using `str_replace(., "%", "")`.
* Make these rows numeric by using `as.numeric()`.
* Divide the values by 100 using `/100`.
```{r, eval = FALSE}
race_df2 <-
race_df %>%
# Remove unnecessary columns
select(-school_group, -grade, -pi_pct, -blank_col) %>%
# Remove the "Grand Total" to get grade-level numbers
filter(str_detect(school_name, "Total"), school_name != "Grand Total") %>%
# Clean up school names
mutate(school_name = str_replace(school_name, "Total", "")) %>%
# Trim white space
mutate(across(where(is.character), trimws)) %>%
# Turn percentage columns into numeric format
mutate(across(matches("pct"), ~ as.numeric(str_replace(., "%", "")) / 100))
```
After importing the Race/Ethnicity data, you’ll import the Free and Reduced-Price Lunch (FRPL) PDFs.
FRPL data are often used as a proxy for poverty in the U.S. [@frpl]. Students from households with incomes up to 185% of the poverty threshold are eligible for free or reduced-price lunch.
Definitions are very important for disaggregated data. FRPL is used because it's ubiquitous, but there is debate as to whether it actually reflects the level of poverty among students.
```{r, eval = FALSE}
# Option 1: Get data using {tabulapdf}
frpl_pdf <-
extract_tables(
"https://github.com/data-edu/data-science-in-education/blob/main/data/agg_data/fall_2018_meal_eligiblity_official.pdf?raw=true"
)
# Option 2: Get data from the book repository
# The code below assumes you have set up folders data and agg_data within your working directory
frpl_pdf <-
readRDS(here("data", "agg_data", "frpl_pdf.Rds"))
# Option 3: Get data using {dataedu}
frpl_pdf <-
dataedu::frpl_pdf
```
Similar to the Race/Ethnicity PDF, you transform the list into a tibble using the following steps:
1. Take the PDF list output and turn it into tibbles using `map(as_tibble())`.
2. Create a single data frame with `list_rbind()`.
3. Remove unnecessary rows from each page using `slice()`.
4. Finally, create column names that are easily understood using `set_names()`.
```{r, eval=FALSE}
frpl_df <-
frpl_pdf %>%
# Turn each page into a tibble
map(~ as_tibble(.x, .name_repair = "unique")) %>%
# Make data frame
list_rbind() %>%
# Remove unnecessary rows
slice(-1) %>%
# Use descriptive column names
set_names(
c(
"school_name",
"not_eligible_num",
# Number of non-eligible students,
"reduce_num",
# Number of students receiving reduced price lunch
"free_num",
# Number of students receiving free lunch
"frpl_num",
# Total number of students (from the FRPL PDF)
"frpl_pct" # Free/reduced price lunch percentage
)
)
```
To clean the dataset further:
1. Remove the blank rows.
2. Notice that there are aggregations inserted into the table that are not at the district level. For example, the report includes `ELM K_08` as a value of the variable `school_name`, presumably to aggregate FRPL numbers up to the K--8 levels. Although this is useful data, you won't need it for this district-level analysis. Filter them out by inserting `!` before the variable name.
3. Turn the data in the `percentage` columns into a numeric data type using `mutate()`.
```{r, eval = FALSE}
frpl_df2 <-
frpl_df %>%
filter(
# Remove blanks
school_name != "",
# Filter out the rows in this list!school_name %in% c(
"ELM K_08",
"Mid Schl",
"High Schl",
"Alt HS",
"Spec Ed Total",
"Cont Alt Total",
"Hospital Sites Total",
"Dist Total"
)
) %>%
# Turn percentage columns into numeric and decimal format
mutate(frpl_pct = as.numeric(str_replace(frpl_pct, "%", "")) / 100)
```
Once the Race/Ethnicity and FRPL data are ready, you’ll join the two tibbles to prepare them for analysis.
1. Join the two tibbles by the variable `school_name`.
2. Use `mutate()` and `across()` to convert the 2nd to 17th columns into numeric data types.
```{r, eval = FALSE}
joined_df <-
# Create full dataset, joined by school name
left_join(race_df2, frpl_df2, by = c("school_name")) %>%
# Convert these columns to numeric
mutate(across(2:17, as.numeric))
```
Did you notice? The total number of students from the Race/Ethnicity table does not match the total number of students from the FRPL table, even though they're referring to the same districts in the same year. Perhaps the two datasets were created using different rules when aggregating the dataset. Perhaps the counts were taken at different times of the year, and students moved around in the meantime. Even if you don't know the cause, you'll need to make strategic decisions about which data to consider the "truth" in this analysis.
Continue preparing the data by creating new columns based on the merged dataset using `mutate()`.
1. NCES defines high poverty as schools that are over 75% FRPL [@ncesfrpl]. Calculate the number of students in "high-poverty" schools in a race group. Do this by counting the number of students for a particular race under the variable `[racename]_povnum` when a school is over 75% FRPL.
2. Calculate the weighted average of students in each category. Start by using `adorn_totals()` in the {janitor} package, which sums columns for you.
3. Create the weighted average of the percentage of each race by dividing the number of students by race by the total number of students.
4. Compute the FRPL percentage for all schools by recalculating `frpl_pct`. Doing this gives you a weighted average.
5. Calculate the percentage of students by race who are in high-poverty schools by dividing the number of students in high-poverty schools by the total number of students in that race.
```{r, eval=FALSE}
district_merged_df <-
joined_df %>%
# Calculate high-poverty numbers
mutate(
hi_povnum = case_when(frpl_pct > .75 ~ hi_num),
aa_povnum = case_when(frpl_pct > .75 ~ aa_num),
wh_povnum = case_when(frpl_pct > .75 ~ wh_num),
as_povnum = case_when(frpl_pct > .75 ~ as_num),
na_povnum = case_when(frpl_pct > .75 ~ na_num)
) %>%
# Calculate totals of students
adorn_totals() %>%
# Create percentage by demographic
mutate(
na_pct = na_num / tot,
aa_pct = aa_num / tot,
as_pct = as_num / tot,
hi_pct = hi_num / tot,
wh_pct = wh_num / tot,
frpl_pct = (free_num + reduce_num) / frpl_num,
# Create percentage by demographic and poverty
hi_povsch = hi_povnum / hi_num[which(school_name == "Total")],
aa_povsch = aa_povnum / aa_num[which(school_name == "Total")],
as_povsch = as_povnum / as_num[which(school_name == "Total")],
wh_povsch = wh_povnum / wh_num[which(school_name == "Total")],
na_povsch = na_povnum / na_num[which(school_name == "Total")]
)
```
To facilitate the creation of plots later on, put this data in tidy format using `pivot_longer()`.
```{r, eval = FALSE}
district_tidy_df <-
district_merged_df %>%
pivot_longer(
cols = -matches("school_name"),
names_to = "category",
values_to = "value"
)
```
For your convenience, we've saved copies of the merged and tidy data in the book's GitHub repository and {dataedu}. To access them, you can run the code below.
```{r, message = FALSE}
# If reading in from book repository
district_tidy_df <-
read_csv(here("data", "agg_data", "district_tidy_df.csv"))
district_merged_df <-
read_csv(here("data", "agg_data", "district_merged_df.csv"))
# If using the {dataedu} package
district_tidy_df <- dataedu::district_tidy_df
district_merged_df <- dataedu::district_merged_df
```
## View data
### Discovering distributions
Now that you've imported the data, it's time to begin exploring it.
What do the racial demographics in this district look like? Use a barplot to visualize the different proportions of subgroups.
```{r fig9-1, fig.align = "center", results = "hide", message = FALSE, fig.cap = "Percentage of Population by Subgroup", fig.showtext = TRUE}
district_tidy_df %>%
# Filter for Total rows, since we want district-level information
filter(school_name == "Total",
str_detect(category, "pct"),
category != "frpl_pct") %>%
# Reordering x-axis so bars appear by descending value
ggplot(aes(x = reorder(category, -value), y = value)) +
geom_bar(stat = "identity", aes(fill = category)) +
labs(title = "Percentage of Population by Subgroup", x = "Subgroup", y = "Percentage of Population") +
# Make labels more readable
scale_x_discrete(
labels = c(
"aa_pct" = "Black",
"wh_pct" = "White",
"hi_pct" = "Hispanic",
"as_pct" = "Asian",
"na_pct" = "Native Am."
)
) +
# Makes labels present as percentages
scale_y_continuous(labels = scales::percent) +
scale_fill_dataedu() +
theme_dataedu() +
theme(legend.position = "none")
```
Look at these data and note the shares of each race group. Almost 40% of students are Black, and around 36% are White. These numbers match the percentages provided in the original PDFs, which validate the calculations you did when you cleaned the data. Good job!
Now, look at these data and note the students who are eligible for FRPL. Recall that `frpl_pct` is the percentage of the students in the district that are eligible for FRPL.
```{r}
district_tidy_df %>%
filter(category == "frpl_pct", school_name == "Total")
```
56.9% of the students are eligible for FRPL, compared to the US average of 52.1% in 2015-16 [@avgfrpl]. These numbers also validate what you saw in the original PDFs.
### Analyzing spread
Now that you've done some initial validations, try exploring the dataset through distributions. This further analysis empowers you to see beyond district totals and into the underlying school data.
Start by exploring student demographics across schools. Here is a histogram for the percentage of White students within the schools for which we have data.
```{r fig9-2, fig.align = "center", results = 'hide', message = FALSE, fig.cap = "Count of Schools by White Population", fig.showtext = TRUE}
district_merged_df %>%
# Remove district totals
filter(school_name != "Total") %>%
# X-axis will be the percentage of White students within schools
ggplot(aes(x = wh_pct)) +
geom_histogram(breaks = seq(0, 1, by = .1),
fill = dataedu_colors("darkblue")) +
labs(title = "Count of Schools by White Population",
x = "White Percentage",
y = "Count") +
scale_x_continuous(labels = scales::percent) +
theme(legend.position = "none") +
theme_dataedu()
```
26 of the 74 (35%) schools have between 0 and 10% White students. This number implies that although the school district may be diverse, the demographics are not evenly distributed across the schools. More than half of schools enroll fewer than 30% of White students, even though White students make up 35% of the district student population.
The school race demographics do not represent the district populations, but does that also hold for socioeconomic status?
## Analysis
### Creating categories
Try the approach you used to look at the distribution of student race groups in all schools to look at the distribution of student race groups in high-poverty schools.
High-poverty schools are defined as public schools where more than 75% of the students are eligible for FRPL. According to NCES, 24% of public school students attended high-poverty schools [@ncesfrpl]. However, different subgroups are overrepresented and underrepresented within high-poverty schools. Is this the case for this district?
```{r fig9-3, fig.align = "center", results = 'hide', message = FALSE, fig.cap = "Distribution of Subgroups in High Poverty Schools", fig.showtext = TRUE}
district_tidy_df %>%
filter(school_name == "Total", str_detect(category, "povsch")) %>%
ggplot(aes(x = reorder(category, -value), y = value)) +
geom_bar(stat = "identity", aes(fill = factor(category))) +
labs(title = "Distribution of Subgroups in High Poverty Schools",
x = "Subgroup",
y = "Percentage in High Poverty Schools") +
scale_x_discrete(
labels = c(
"aa_povsch" = "Black",
"wh_povsch" = "White",
"hi_povsch" = "Hispanic",
"as_povsch" = "Asian",
"na_povsch" = "Native Am."
)
) +
scale_y_continuous(labels = scales::percent) +
scale_fill_dataedu() +
theme_dataedu() +
theme(legend.position = "none")
```
8% of White students attend high-poverty schools, compared to 43% of Black students, 39% of Hispanic students, 28% of Asian students, and 45% of Native American students. In this district, non-White students disproportionally attend high-poverty schools.
### Reveal relationships
So far, you've validated your initial calculations and explored distributions. Next, you'll examine correlations.
Begin by investigating the correlation between race and FRPL percentage by school.
```{r fig9-4, warning = FALSE, fig.align = "center", message = FALSE, results = 'hide', fig.cap="FRPL Percentage vs. White Percentage", fig.showtext = TRUE}
district_merged_df %>%
filter(school_name != "Total") %>%
ggplot(aes(x = wh_pct, y = frpl_pct)) +
geom_point(color = dataedu_colors("green")) +
labs(title = "FRPL Percentage vs. White Percentage",
x = "White Percentage",
y = "FRPL Percentage") +
scale_y_continuous(labels = scales::percent) +
scale_x_continuous(labels = scales::percent) +
theme_dataedu() +
theme(legend.position = "none")
```
Similar to the result in Creating Categories, a strong negative correlation exists between FRPL percentage and the percentage of White students in a school. That is, high-poverty schools appear to have a lower percentage of White students, and low-poverty schools have a higher percentage of White students.
## Results
Because of the disaggregated data this district provides, you can go deeper than the average demographics across the district and analyze at the school level. These distinct but closely related views demonstrate that:
1. The distribution of race/ethnicity within schools does not represent the district.
2. Students of color are overrepresented in high-poverty schools.
3. There is a negative relationship between the percentage of White students in a school and the percentage of students eligible for FRPL.
## Conclusion
This analysis, like all analyses, does not occur in a vacuum. According to the Urban Institute, the disproportionate percentage of students of color attending high-poverty schools "is a defining feature of almost all Midwestern and northeastern metropolitan school systems" [@urbanpov]. Among other issues, "high poverty schools tend to lack the educational resources - like highly qualified and experienced teachers, low student-teacher ratios, college prerequisite and advanced placement courses, and extracurricular activities - available in low-poverty schools". This disproportionality has a huge impact on these students and their futures.
In addition, research shows that racial and socioeconomic diversity in schools can provide students with a range of cognitive and social benefits. Therefore, the deep segregation that exists in the district can have adverse effects on students.
As data scientists in education, we can use these data to identify the inequity in a system and suggest interventions to improve the situation. In addition, we can advocate for more datasets such as these, which allow us to analyze outcomes in total and across subgroups. In the next chapter, we discuss aggregate data further, focusing on how to understand changes over time.