forked from briatte/ida
-
Notifications
You must be signed in to change notification settings - Fork 0
/
042_reshaping.Rmd
176 lines (142 loc) · 8.57 KB
/
042_reshaping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")
# 4.2. Reshapes and aggregates
[dsn-bubble]: http://www.decisionsciencenews.com/2011/03/17/the-housing-bubble-by-city/
[sap-csi]: http://www.standardandpoors.com/indices/sp-case-shiller-home-price-indices/en/us/
[gs-aggr]: https://gastonsanchez.wordpress.com/2012/06/28/5-ways-to-do-some-calculation-by-groups/
Datasets do not always come in the exact form in which we want to analyze them: there is often some manipulation to do. Most people do this by hand, which is error-prone and time-consuming. This section covers a few ways to reshape a dataset. Reshaping is a set of operations to handle aggregates in your data, as when you want to compute the average income of each age and gender group.
Both examples below are based on the `ggplot2` and `reshape` packages.
```{r packages, message = FALSE, warning = FALSE}
# Load packages.
packages <- c("downloader", "ggplot2", "plyr", "reshape", "scales")
packages <- lapply(packages, FUN = function(x) {
if(!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
})
```
## Visualizing the U.S. housing market by city
Here's an [example based on the recent housing bubble][dsn-bubble]. It shows changes in the [Case Schiller Index][sap-csi] a price index for the housing market, over the recent years.
```{r csi-data, eval = FALSE, include = FALSE}
dat <- read.csv("data/CSHomePrice_History.csv")
write.csv(dat, "data/schiller.8712.csv", row.names = FALSE)
```
```{r csi-load, results='hide'}
# Open the data.
csi <- read.csv("data/schiller.8712.csv")
# Inspect the top data structure.
str(csi[, 1:5])
# Inspect the first data rows/columns.
head(csi[, 1:5])
```
You can see that the data uses U.S. states as columns. Instead of that "wide" format, we prefer to use "long" data where all states are held in a single column. To get this arrangement in more rows and less columns, we "collapse" the data: all column variables are put into a single one, except for the one identifier variable, year, which is repeated over rows.
```{r csi-melt, results='hide'}
# Collapse the data by years.
csi.melted <- melt(csi, id = "YEAR")
# Name the columns.
names(csi.melted) <- c("MonthYear", "City", "IndexValue")
```
The `MonthYear` variable is given a `01-` prefix to make it the first day of each month, and that result is converted into a date, which is required for optimal plotting. This process uses time code format for day-month-year, `%d-%b-%y`, where `%d` is the day, `%b` the abbreviated month and `%y` the year without century. This process is explored again when we get to time series in [Session 9][090].
[090]: 090_ts.html
```{r csi-fix-date}
# Convert dates.
csi.melted$Date <- as.Date(paste0("01-", csi.melted$MonthYear), "%d-%b-%y")
```
Finally, here's the first plot, using a different color for each city. There's a lot of data but the general trend is clearly understandable. You will get a warning due to missing values, but the plot should still process and look as below.
```{r csi-plot-all-auto, fig.width = 8, fig.height = 6, tidy = FALSE, warning = FALSE}
# Build line plot.
g1 <- ggplot(data = csi.melted, aes(x = Date, y = IndexValue)) +
geom_line(aes(color = City), size = 1.25) +
labs(x = NULL, y = "Case Schiller Index")
# View result.
g1
```
Let's go further and subset the data to a few urban or large states and plot the data again. Subsetting is a very common operation, so make sure that you learn about the `subset()` function to subset data frames. The argument that we pass here uses the `%in%` selector, which returns TRUE for each value of the `City` variable in the `csi` object that is present in the `cities` list.
```{r csi-plot-selection-auto, fig.width = 8, fig.height = 6, tidy = FALSE, warning = FALSE}
# Select only a handful of states.
cities = c('NY.New.York', 'FL.Miami', 'CA.Los Angeles', 'MI.Detroit',
'TX.Dallas', 'IL.Chicago', 'DC.Washington')
# Create a subset of the data.
csi.subset = subset(csi.melted, City %in% cities)
# Build plot.
g2 <- ggplot(data = csi.subset, aes(x = Date, y = IndexValue)) +
geom_line(aes(color = City), size = 1.25) +
labs(x = NULL, y = "Case Schiller Index")
# View result.
g2
```
## Visualizing U.S. homicide trends by weapon type
In this segment, we show how to draw smooth trends of assault deaths in the United States. The data are from the [Bureau of Justice Statistics][bjs] and were mentioned on the Reddit [/r/datasets/][reddit-datasets] channel in early 2013.
[bjs]: http://bjs.ojp.usdoj.gov/
[reddit-datasets]: http://www.reddit.com/r/datasets
The first code block below will download the data and save it for you if you do not already have the data at hand: it looks for the `htus8008` ZIP archive in your `data` folder (`htus8008` stands for "Homicide Trends in the United States, 1980-2008"), and downloads the original data source if necessary. It then looks for a selected file inside the `htus8008` folder.
```{r bjs-zip}
# Identify ZIP folder.
zip = "data/htus8008.zip"
# Download ZIP archive.
if(!file.exists(zip)) {
# Target data source.
url = "http://bjs.ojp.usdoj.gov/content/pub/sheets/htus8008.zip"
# Download ZIP archive.
download(url, zip, mode = "wb")
}
# Inspect ZIP contents.
head(unzip(zip, list = TRUE))
# Read CSV file.
bjs <- read.csv(unz(zip, "htus8008f42.csv"), skip = 11, nrows = 29)
```
The second code block imports one of the CSV files by reading only the valid data lines from it. See the `README` file of the BJS data folder for a list of all series, and open a few files to see how they are structured. The code block also does a bit of data cleaning: it removes the last (empty) column and replaces the dots in the names of some columns with spaces. It ends on routine data checks.
```{r bjs-data}
# Inspect the data.
str(bjs)
# Remove last column.
bjs <- bjs[, -7]
# Clean names.
names(bjs) <- gsub("\\.", " ", names(bjs))
# Check first rows.
head(bjs)
# Check final rows.
tail(bjs)
```
Our last step is to prepare the data by reshaping it to long format with the `melt()` function. The final result has one unit of analysis (years), one category of observations (weapon type), and one column of values (homicide counts). We rename the columns to proper variable names and reorder the weapon types, which are factors, by the average homicide count.
```{r bjs-prepare, message=FALSE}
# Reshape to long format.
bjs <- melt(bjs, id = "Year")
# Check result.
head(bjs)
# Rename variables.
names(bjs) <- c("Year", "Weapon", "Count")
# Inspect weapon types.
levels(bjs$Weapon)
# Order weapon type by homicide count.
bjs$Weapon <- with(bjs, reorder(Weapon, -Count, mean))
```
The plot is now easy to set up with `ggplot2`: we set the canvas to represent the homicide counts on the vertical y-axis and the year on the horizontal x-axis, and then plot one colored line per weapon category. There's all sort of tweaks that might apply at that stage of the plot. A simple one used here is to format the vertical y-scale, to show commas every 1,000 units.
```{r bjs-plot-auto, fig.width = 8, fig.height = 6, tidy = FALSE}
# Plot canvas.
qplot(data = bjs, y = Count, x = Year, color = Weapon,
geom = c("line", "point")) +
labs(y = "Homicide count", x = NULL) +
scale_y_continuous(labels = comma)
```
To extract whatever statistic we need from the `bjs` dataset, we can apply a `mean` or `summary` function to the data, splitted by weapon type. These methods were first mentioned when we described vectorization, and are shown again when we cover [descriptive statistics](061_description.html). Four methods to [calculate by groups][gs-aggr] are shown below.
```{r bjs-tapply, tidy = FALSE}
# Average homicide count by weapon type, using with() and tapply().
with(bjs, tapply(Count, Weapon, mean))
# Similar syntax with by() to get quintiles of homicide counts by weapon type.
by(bjs$Count, bjs$Weapon, quantile)
```
The `tapply()` and `by()` functions shown above resemble each other very much, while the next two syntaxes are more specific. The `aggregate()` function uses formula notation of the form `y ~ x`, which can also be used to write models and to facet plots in `ggplot2` syntax. The `ddply()` function follows the syntax of the `plyr` package, which handles advanced data transformation routines.
```{r bjs-aggregate}
# aggregate()'s formula notation of the form variable ~ group.
aggregate(Count ~ Weapon, bjs, summary)
# ddply()'s more demanding syntax offers more functionality.
ddply(bjs, .(Weapon), summarise,
N = length(Count),
Mean = mean(Count),
SD = sd(Count),
Min = min(Count),
Max = max(Count))
```
> __Next__: [Practice](043_practice.html).