-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-rbasics.Rmd
502 lines (368 loc) · 13 KB
/
02-rbasics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
# Basics
```{r, include=FALSE}
source("_common.R", local = knitr::knit_global())
```
## Prerequisites
We introduce the basics in R programming in this chapter. We will review the
basic operators and data types in this chapter. We also provide an introduction
to the basic R data structures (including tibbles and data tables).
Most of this chapter involves working with R basic operators and data types,
which do not require any extra packages. We will also introduce the `{tibble}`
package, which forms part of the `{tidyverse}` package in section
\@ref(tibbles), and the `{data.table}` package in section \@ref(data-table)
```{r, message=FALSE}
library(tidyverse)
library(data.table)
```
## Basic operators
Basic arithmetic operators (`+`, `-`, `*`, `/`, `^`, \`%') would work like your
calculator
```{r}
3 + 2 # addition
3 - 2 # subtraction
3 * 2 # multiplication
3 / 2 # division
3^2 # exponent
3 %/% 2 # integer division
3 %% 2 # mod (remainder of a division)
```
R uses the `<-` operator for assignments. You can read the following code as
assigning the outcome of `2 + 3`, `5`, and `3` to the object
`value_a`,`value_b`, and `value_c` respectively, which stores it for later use.
```{r}
value_a <- 2 + 3
value_b <- 5
value_c <- 3
```
You can `print` what is stored in the object to console with
```{r}
print(value_a)
```
You can use relational operators to compare how one object relates to another.
```{r}
2 + 3 == 5 # TRUE that 2 + 3 equals 5
2 + 3 != 5 # FALSE that 2 + 3 not equals to 5
2 + 3 < 3 # FASLE that 2 + 3 is less than 3
2 + 3 > 3 # TRUE that 2 + 3 is more than 3
2 + 3 <= 5 # TRUE that 2 + 3 is less than or equal to 5
2 + 3 >= 5 # TRUE that 2 + 3 is more than or equal to 5
```
You can use logical operators to connect two or more expressions. For example,
to connect the results of the comparisons made using relational operators.
```{r}
(2 + 3 == 5) && (2 + 3 < 3) # logical AND operator
(2 + 3 == 5) || (2 + 3 >= 3) # logical OR operator
```
Note that the logical `&&` and `||` only examines the first element of a vector.
```{r}
x <- c(TRUE, TRUE, FALSE)
y <- c(FALSE, TRUE, FALSE)
x && y
x || y
```
To perform element-wise logical operations, use `&` and `|` instead
```{r}
x & y
x | y
!y
```
## Basic data types
There are basic data types (also known as atomic data types) in R in order to
use them.
| Data Type | Examples | Additional Information |
|:----------|:-------------------------|:----------------------------------------------------------|
| Logical | `TRUE`, `FALSE` | Boolean values |
| Numeric | `1`, `999.9` | Default data type for numbers |
| Integer | `1L`, `999L` | `L` is used to denote an integer |
| Character | `"a"`, `"R for BES"` | Data type for one or more characters |
| Complex | `2 + 3i` | Data type for numbers with a real and imaginary component |
| Raw | `charToRaw("R for BES")` | Not commonly used data type used to store raw bytes |
## Basic data structures
The basic data structures in R include factors, atomic vectors, lists, matrices,
and `data.frame`s
Factors are used in R to represent categorical variables. Although they appear
similar to character vectors they are actually stored as integers. You can use
the function `levels()` to output the categorical variables and `nlevels()` to
check the number of categorical variables.
```{r create-factors}
eye_color <- factor(c("brown", "black", "green", "brown", "black", "blue"))
nlevels(eye_color)
levels(eye_color)
```
Atomic vectors or more frequently referred to as vectors are a data structure
that is used to store multiple objects of the same data type (logical, numeric,
integer, character, complex, or raw). Vectors are one-indexed (i.e., the first
element is indexed using `[1]`) and you can get the number of elements in the
vector using the function `length()`. The function `class()` can be used to
reveal the class of any object in R.
```{r create-vectors}
vec_num <- c(1, 2, 3, 4)
class(vec_num)
vec_char <- c("R", "for", "BES")
class(vec_char)
# coercion if data types are mixed
vec_mix <- c("R", 4, "BES")
class(vec_mix)
# you can easily combine vectors using the c function
c(vec_char, vec_mix)
# vector length
length(vec_num)
# access first element of vector
vec_num[1]
```
Lists are an ordered data structure that is used to store multiple R objects of
different types. The function `list()` is used to create a list and a list in R
can be accessed using a single `[]` or double brackets `[[]]`. Using `[]`
returns a list of the selected element while using `[[]]` returns the selected
element. Using the function `length()`, you can obtain the number of objects in
a list.
```{r}
my_list <- list(
c(1, 2, 3, 4),
c("a", "b"),
1L,
matrix(1:9, ncol = 3)
)
my_list_a <- my_list[1]
class(my_list_a)
my_list_b <- my_list[[1]]
class(my_list_b)
length(my_list)
```
If you have named the elements in your list, you could also access them by
specifying their names in the brackets or using the `$` operator.
```{r}
named_list <- list(
a = c(1, 2, 3, 4),
b = c("a", "b"),
c = 1L,
d = matrix(1:9, ncol = 3)
)
class(named_list["a"])
class(named_list[["a"]])
class(named_list$a)
```
A matrix is a two dimensional data structure that is used to store multiple
objects. You can use the function `matrix()` to create a matrix using the `ncol`
and `nrow` argument to specify the number of columns and rows respectively, and
the `byrow` argument to specify how the data in would be ordered.
```{r}
matrix(1:12, ncol = 3, byrow = FALSE)
matrix(1:12, nrow = 3, byrow = FALSE)
matrix(1:12, ncol = 3, byrow = TRUE)
```
Aside from numeric data types, a matrix can also be used to store other data
types as long as they are homogeneous. To store heterogeneous data types, you
should use a data frame which is introduced next.
```{r}
matrix(c("brown", "black", "green", "brown", "black", "blue"), ncol = 2)
matrix(c("TRUE", "TRUE", "FALSE", "FALSE"), ncol = 2)
matrix(c(1L, 2L, 3L, 4L), ncol = 2)
```
You can access the values in a matrix by providing the row and column index. For
example, `[1, 2]` would return the value stored in the first row, second column
of the matrix. `[3, 1]` would return the value stored in the third row, first of
the matrix. You can retrieve all the values in a column by leaving the row index
empty and likewise retrieve all the values in a row by leaving the column index
empty. For example, `[, 2]` would return all the values in column 2 and `[2, ]`
would return all the values in row 2.
```{r}
m <- matrix(1:12, nrow = 3, byrow = FALSE)
m[1, 2]
m[3, 1]
m[, 2]
m[2, ]
```
A `data.frame` is a two-dimensional data structures that are used to store
heterogeneous data types in R. As a result of it's convenience, data frames are
a commonly used data structure in R. You can use the function `data.frame()` to
create a data frame.
```{r}
df <- data.frame(
x = c(1, 2, 3),
y = c("red", "green", "blue"),
z = c(TRUE, FALSE, TRUE)
)
```
You can access elements of a data.frame like a list `[]`, `[[]]` or `$`. Using
`[]` returns a `data.frame` of the selected element while using `[[]]` or `$`
will reduce it to a vector.
```{r}
df
df["y"]
df[["y"]]
df$y
```
You can also access a `data.frame` like a matrix.
```{r}
df
df[1, 2]
df[, 2]
df[2, ]
```
## Tibbles {#tibbles}
Tibbles are basically a modified version of R's `data.frame`. Therefore, you
would also access tibbles like how you would access a `data.frame`. You can
create a tibble using the function `tibble()`. Alternatively, you can coerce a
data frame into a tibble using `as_tibble()`.
```{r}
tibble(
x = c(1, 2, 3),
y = c("red", "green", "blue"),
z = c(TRUE, FALSE, TRUE)
)
as_tibble(iris)
```
A key difference lies in how tibbles are printed. Printing a tibble only results
in the first ten rows being displayed with an explicit reporting of each
column's data type.
```{r}
tb <- as_tibble(iris)
print(tb)
df <- as.data.frame(iris)
print(df)
```
Unlike a `data.frame`, tibbles provides clarity on the data structure that it
returns. When indexing with tibbles, `[` always return another tibble while `[[`
and `$` alway returns a vector. In contrast, single column data frames are often
converted into atomic vectors in R unless `drop = FALSE` is specified.
```{r}
class(tb[, 1])
class(tb[[1]])
class(tb$Sepal.Length)
class(df[, 1])
class(df[, 1, drop = FALSE])
```
Additionally, tibbles do not do partial matching and raises a warning unless the
variable specified is an exact match.
```{r}
tb$Sepal.Lengt
df$Sepal.Lengt
```
You can read more about tibbles by typing `vignette("tibble")` in your console.
## data.table {#data-table}
Like tibbles, `data.table`s are an enhanced version of `data.frame`s. You can
create a `data.table` using the function `data.table()`. You can also coerce
existing R objects into a `data.table` with `setDT()` for `data.frame`s and
lists, and `as.data.table()` for other data structures. Note that
`as.data.table()` also works with `data.frame`s and lists. However `setDT()` is
more memory efficient because it does not create a copy of the original data
frame or list but instead returns a data table by reference.
```{r}
dt <- data.table(
x = c(1, 2, 3),
y = c("red", "green", "blue"),
z = c(TRUE, FALSE, TRUE)
)
class(
setDT(
data.frame(c(1, 2, 3))
)
)
```
`data.table`s provide additional functionality through the way it is queried.
The general form for working with a data table is `[i, j, by]`, which can be
read as subset rows using `i`, operate on `j`, and grouped by `by`.
Lets see how this work using the `iris` example dataset.
```{r}
dt <- as.data.table(iris)
print(dt)
```
If you want to get an explicit reporting of each column's data type, as the
`tibble` does by default, you can set the `data.table.print.class` to `TRUE`.
```{r}
options(datatable.print.class = TRUE)
print(dt)
```
You can filter the rows to only contain `Species == "virginica`.
```{r}
dt[Species == "virginica"]
```
You can select the columns using the `j` expression. Notice that wrapping the
variables within `list()` or `.()` ensures that a `data.table` is returned. In
contrast, an atomic vector is returned when `list()` or `.()` is not used. `.()`
is an alias for `list()` and therefore the two are the same.
```{r}
dt[, Sepal.Length]
class(dt[, Sepal.Length])
class(dt[, list(Sepal.Length)])
class(dt[, .(Sepal.Length)])
```
You can select multiple columns with `list()` or `.()`.
```{r}
dt[, .(Sepal.Length, Species)]
```
You can also save the targeted column names in a variable and use it to specify
columns with `..` prefix.
```{r}
cols <- c("Sepal.Length", "Species")
dt[, ..cols]
```
Aside from selecting columns using `j`, you can carry out computations on `j`
involving one or more columns and a subset of rows using `i`.
```{r}
dt[, mean(Sepal.Length)]
dt[, .(
Sepal.Length.Mean = mean(Sepal.Length),
Sepal.With.Mean = mean(Sepal.Width)
)]
dt[
Species == "virginica" & Sepal.Length < 6,
.(
Sepal.Length.Mean = mean(Sepal.Length),
Sepal.With.Mean = mean(Sepal.Width)
)
]
```
You can then use the `by` expression in data tables to perform computations by
groups.
```{r}
dt[, .(
Sepal.Length.Mean = mean(Sepal.Length),
Sepal.With.Mean = mean(Sepal.Width)
),
by = Species
]
```
The `.N` variable that counts the number of instances is particularly useful
when combined with `by`.
```{r}
dt[, .N, by = Species]
```
You can also apply it to multiple columns using the `list()` or `.()` notation.
You can read the code below as calculating the mean of `Speal.Length` and the
number of instances (given by `.N`) grouped by their `Species` and whether
`Sepal.Length < 6`.
```{r}
dt[, .(Sepal.Length.Mean = mean(Sepal.Length), .N),
by = .(Species, Sepal.Length < 6)
]
```
`data.table`s add, update, and delete columns by reference to avoid redundant
copies for performance improvements. You can use the `:=` operator to add,
update, and delete columns in `j` by reference. There are two forms for using
`:=` and they are: `[, LSH := RHS]` and `[,`:=`(LHS = RHS)]`.
```{r}
dt <- as.data.table(iris)
dt[, Sepal.Sum := .(Sepal.Length + Sepal.Width)]
head(dt)
dt[, `:=`(Petal.Sum = Petal.Length + Petal.Width)]
head(dt)
```
Note that in the above code, we do not need to make any assignments back to a
variable because the modification is done by reference or in place. In other
words we are modifying `dt` and not a copy of `dt`. Therefore, you will also see
that if you run the entire code chunk above, `dt` will contain both columns
`Sepal.Sum` and `Petal.Sum`.
Since `:=` is used in `j`, it can be combined with `i` and `by` as we have seen
in the earlier parts of this sub-section.
```{r}
dt[Species == "versicolor" | Species == "virginica", Sepal.Length := 0]
head(dt)
dt[, Sepal.Length.Mean := mean(Sepal.Length),
by = .(Species)
]
head(dt)
```
You can find out more about `data.table`s by typing
`vignette(package = "data.table")` into the console.