-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathday90_correlations_with_corr.Rmd
231 lines (182 loc) · 3.72 KB
/
day90_correlations_with_corr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "How to explore correlations?"
output:
html_document:
df_print: paged
---
[Source](https://drsimonj.svbtle.com/exploring-correlations-in-r-with-corrr)
```{r}
d <- mtcars
d$hp[3] <- NA
head(d)
```
We could be motivated by multicollinearity:
```{r}
fit_1 <- lm(mpg ~ hp, data = d)
fit_2 <- lm(mpg ~ hp + disp, data = d)
```
```{r}
summary(fit_1)
```
```{r}
summary(fit_2)
```
Strange result. Let’s check the correlations between `mpg`, `hp`, and `disp` to try and diagnose this problem. It should be simple using the base R function, `cor()`. Right?
Err, what is with all the `NA`‘s ?
```{r}
rs <- cor(d)
rs
```
Have to handle missing values with `use`:
```{r}
rs <- cor(d, use = "pairwise.complete.obs")
rs
```
Can we focus on subset with dplyr? Nope.
```{r}
#dplyr::select(rs, mpg, hp, disp)
```
Riiiiiight! It’s a matrix and dplyr is for data frames.
```{r}
class(rs)
```
So we can use square brackets with matrices? Or not…
```{r}
vars <- c("mpg", "hp", "disp")
rs[rownames(rs) %in% vars]
```
Mm, square brackets can take on different functions with matrices. Without a comma, it’s treated like a vector. With a comma, we can separately specify the dimensions.
```{r}
vars <- c("mpg", "hp", "disp")
rs[rownames(rs) %in% vars, colnames(rs) %in% vars]
```
We diagnosed our multicollinearity problem. What if we want to something a bit more complex like exploring clustering of variables in high dimensional space? Could use exploratory factor analysis.
```{r}
factanal(na.omit(d), factors = 2)
```
```{r}
factanal(na.omit(d), factors = 5)
```
So many questions! I’d much rather explore the correlations.
Let’s try to find all variables with a correlation greater than 0.90. Why doesn’t this work?!
```{r}
col_has_over_90 <- apply(rs, 2, function(x) any(x > .9))
rs[, col_has_over_90]
```
The diagonal is 1. All cols have a value greater than .90!
Exclude diagonal:
```{r}
diag(rs) <- NA
col_has_over_90 <- apply(rs, 2, function(x) any(x > .9, na.rm = TRUE))
rs[, col_has_over_90]
```
## Exploring data with the tidyverse
```{r}
library(tidyverse)
d %>%
select(mpg:drat) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~key, scales = "free")
```
# Using `corr`
```{r}
library(corrr)
d %>%
correlate() %>%
focus(mpg:drat, mirror = TRUE) %>%
network_plot()
```
```{r}
rs <- correlate(d)
rs
```
```{r}
rs %>%
select(mpg:drat) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~key)
```
How about that challenge to find cols with a correlation greater than .9?
```{r}
any_over_90 <- function(x) any(x > .9, na.rm = TRUE)
rs %>% select_if(any_over_90)
```
```{r}
rs %>%
focus(mpg, disp, hp)
```
```{r}
rs %>%
focus(-mpg, -disp, -hp)
```
```{r}
rs %>%
focus(mpg, disp, hp, mirror = TRUE)
```
```{r}
rs %>%
focus(matches("^d"))
```
```{r}
rs %>%
focus(mpg)
```
```{r}
rs %>%
focus(mpg) %>%
mutate(rowname = reorder(rowname, mpg)) %>%
ggplot(aes(rowname, mpg)) +
geom_col() + coord_flip()
```
```{r}
rs %>% rearrange()
```
```{r}
rs %>% shave()
```
```{r}
rs %>% stretch()
```
```{r}
rs %>%
shave() %>%
stretch(na.rm = FALSE) %>%
ggplot(aes(r)) +
geom_histogram()
```
```{r}
rs %>%
focus(mpg:drat, mirror = TRUE) %>%
rearrange() %>%
shave(upper = FALSE) %>%
select(-hp) %>%
filter(rowname != "drat")
```
```{r}
rs %>% fashion()
```
```{r}
rs %>%
focus(mpg:drat, mirror = TRUE) %>%
rearrange() %>%
shave(upper = FALSE) %>%
select(-hp) %>%
filter(rowname != "drat") %>%
fashion()
```
```{r}
rs %>% rplot()
```
```{r}
rs %>%
rearrange(method = "MDS", absolute = FALSE) %>%
shave() %>%
rplot(shape = 15, colors = c("red", "green"))
```
```{r}
rs %>% network_plot(min_cor = .6)
```