Skip to content

Latest commit

 

History

History
1751 lines (1295 loc) · 42.8 KB

Advanced_data_manipulation.md

File metadata and controls

1751 lines (1295 loc) · 42.8 KB

Bioinformatics for Big Omics Data: Advanced data manipulation

Raphael Gottardo
December 8, 2014

Setting up some options

Let's first turn on the cache for increased performance.

# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=80), cache=TRUE)

Motivation

  • R has pass-by-value semantics, which minimizes accidental side effects. However, this can become a major bottleneck when dealing with large datasets both in terms of memory and speed.

  • Working with data.frames in R can also be painful process in terms of writing code.

  • Fortunately, R provides some solution to this problems.

Overview

Here we will review three R packages that can be used to provide efficient data manipulation:

  • data.table: An package for efficient data storage and manipulation
  • RSQLite: Database Interface R driver for SQLite
  • sqldf: An R package for runing SQL statements on R data frames, optimized for convenience

Thank to Kevin Ushey (@kevin_ushey) for the data.table notes and Matt Dowle and Arunkumar Srinivasan for helpful comments.

What is data.table?

data.table is an R package that extends. R data.frames.

Under the hood, they are just `data.frame's, with some extra 'stuff' added on. So, they're collections of equal-length vectors. Each vector can be of different type.

library(data.table)
dt <- data.table(x = 1:3, y = c(4, 5, 6), z = letters[1:3])
dt
##    x y z
## 1: 1 4 a
## 2: 2 5 b
## 3: 3 6 c
class(dt)
## [1] "data.table" "data.frame"

The extra functionality offered by data.table allows us to modify, reshape, and merge data.tables much quicker than data.frames. See that data.table inherits from data.frame!

Installing data.table

  • stable CRAN release
# Only install if not already installed.
require(data.table) || install.packages("data.table")
  • latest bug-fixes + enhancements (No need to do that, we will use the stable release)
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)

What's different?

Most of your interactions with data.tables will be through the subset ([) operator, which behaves quite differently for data.tables. We'll examine a few of the common cases.

Visit this stackoverflow question for a summary of the differences between data.frames and data.tables.

Single element subsetting

library(data.table)
DF <- data.frame(x = 1:3, y = 4:6, z = 7:9)
DT <- data.table(x = 1:3, y = 4:6, z = 7:9)
DF[c(2, 3)]
##   y z
## 1 4 7
## 2 5 8
## 3 6 9
DT[c(2, 3)]
##    x y z
## 1: 2 5 8
## 2: 3 6 9

By default, single-element subsetting in data.tables refers to rows, rather than columns.

Row subsetting

library(data.table)
DF <- data.frame(x = 1:3, y = 4:6, z = 7:9)
DT <- data.table(x = 1:3, y = 4:6, z = 7:9)
DF[c(2, 3), ]
##   x y z
## 2 2 5 8
## 3 3 6 9
DT[c(2, 3), ]
##    x y z
## 1: 2 5 8
## 2: 3 6 9

Notice: row names are lost with data.tables. Otherwise, output is identical.

Column subsetting

library(data.table)
DF <- data.frame(x = 1:3, y = 4:6, z = 7:9)
DT <- data.table(x = 1:3, y = 4:6, z = 7:9)
DF[, c(2, 3)]
##   y z
## 1 4 7
## 2 5 8
## 3 6 9
DT[, c(2, 3)]
## [1] 2 3

DT[, c(2,3)] just returns c(2, 3). Why on earth is that?

The j expression

The subset operator is really a function, and data.table modifies it to behave differently.

Call the arguments we pass e.g. DT[i, j], or DT[i].

The second argument to [ is called the j expression, so-called because it's interpreted as an R expression. This is where most of the data.table magic happens.

j is an expression evaluated within the frame of the data.table, so it sees the column names of DT. Similarly for i.

First, let's remind ourselves what an R expression is.

Expressions

An expression is a collection of statements, enclosed in a block generated by braces {}.

## an expression with two statements
{
    x <- 1
    y <- 2
}
## the last statement in an expression is returned
k <- {
    print(10)
    5
}
## [1] 10
print(k)
## [1] 5

The j expression

So, data.table does something special with the expression that you pass as j, the second argument, to the subsetting ([) operator.

The return type of the final statement in our expression determines the type of operation that will be performed.

In general, the output should either be a list of symbols, or a statement using :=.

We'll start by looking at the list of symbols as an output.

An example

When we simply provide a list to the j expression, we generate a new data.table as output, with operations as performed within the list call.

library(data.table)
DT <- data.table(x = 1:5, y = 1:5)
DT[, list(mean_x = mean(x), sum_y = sum(y), sumsq = sum(x^2 + y^2))]
##    mean_x sum_y sumsq
## 1:      3    15   110

Notice how the symbols x and y are looked up within the data.table DT. No more writing DT$ everywhere!

Using :=

Using the := operator tells us we should assign columns by reference into the data.table DT:

library(data.table)
DT <- data.table(x = 1:5)
DT[, `:=`(y, x^2)]
##    x  y
## 1: 1  1
## 2: 2  4
## 3: 3  9
## 4: 4 16
## 5: 5 25

Using :=

By default, data.tables are not copied on a direct assignment <-:

library(data.table)
DT <- data.table(x = 1)
DT2 <- DT
DT[, `:=`(y, 2)]
##    x y
## 1: 1 2
DT2
##    x y
## 1: 1 2

Notice that DT2 has changed. This is something to be mindful of; if you want to explicitly copy a data.table do so with DT2 <- copy(DT).

A slightly more complicated example

library(data.table)
DT <- data.table(x = 1:5, y = 6:10, z = 11:15)
DT[, `:=`(m, log2((x + 1)/(y + 1)))]
##    x  y  z          m
## 1: 1  6 11 -1.8073549
## 2: 2  7 12 -1.4150375
## 3: 3  8 13 -1.1699250
## 4: 4  9 14 -1.0000000
## 5: 5 10 15 -0.8744691

Using an expression in j

Note that the right-hand side of a := call can be an expression.

library(data.table)
DT <- data.table(x = 1:5, y = 6:10, z = 11:15)
DT[, `:=`(m, {
    tmp <- (x + 1)/(y + 1)
    log2(tmp)
})]
##    x  y  z          m
## 1: 1  6 11 -1.8073549
## 2: 2  7 12 -1.4150375
## 3: 3  8 13 -1.1699250
## 4: 4  9 14 -1.0000000
## 5: 5 10 15 -0.8744691

Multiple returns in j

The left hand side of a := call can also be a character vector of names, for which the corresponding final statement in the j expression should be a list of the same length.

library(data.table)
DT <- data.table(x = 1:5, y = 6:10, z = 11:15)
DT[, `:=`(c("m", "n"), {
    tmp <- (x + 1)/(y + 1)
    list(log2(tmp), log10(tmp))
})]
##    x  y  z          m          n
## 1: 1  6 11 -1.8073549 -0.5440680
## 2: 2  7 12 -1.4150375 -0.4259687
## 3: 3  8 13 -1.1699250 -0.3521825
## 4: 4  9 14 -1.0000000 -0.3010300
## 5: 5 10 15 -0.8744691 -0.2632414
DT[, `:=`(a = x^2, b = y^2)]
##    x  y  z          m          n  a   b
## 1: 1  6 11 -1.8073549 -0.5440680  1  36
## 2: 2  7 12 -1.4150375 -0.4259687  4  49
## 3: 3  8 13 -1.1699250 -0.3521825  9  64
## 4: 4  9 14 -1.0000000 -0.3010300 16  81
## 5: 5 10 15 -0.8744691 -0.2632414 25 100
DT[, `:=`(c("c", "d"), list(x^2, y^2))]
##    x  y  z          m          n  a   b  c   d
## 1: 1  6 11 -1.8073549 -0.5440680  1  36  1  36
## 2: 2  7 12 -1.4150375 -0.4259687  4  49  4  49
## 3: 3  8 13 -1.1699250 -0.3521825  9  64  9  64
## 4: 4  9 14 -1.0000000 -0.3010300 16  81 16  81
## 5: 5 10 15 -0.8744691 -0.2632414 25 100 25 100

The j expression revisited

So, we typically call j the j expression, but really, it's either:

  1. An expression, or

  2. A call to the function :=, for which the first argument is a set of names (vectors to update), and the second argument is an expression, with the final statement typically being a list of results to assign within the data.table.

As I said before, a := b is parsed by R as ":="(a, b), hence it looking somewhat like an operator.

quote(`:=`(a, b))
## `:=`(a, b)

Why does it matter?

Whenever you sub-assign a data.frame, R is forced to copy the entire data.frame.

That is, whenever you write DF$x <- 1, DF["x"] <- 1, DF[["x"]] <- 1...

... R will make a copy of DF before assignment.

This is done in order to ensure any other symbols pointing at the same object do not get modified. This is a good thing for when we need to reason about the code we write, since, in general, we expect R to operate without side effects.

Unfortunately, it is prohibitively slow for large objects, and hence why := can be very useful.

Why does it matter?

library(data.table)
library(microbenchmark)
big_df <- data.frame(x = rnorm(1e+06), y = sample(letters, 1e+06, TRUE))
big_dt <- data.table(big_df)
microbenchmark(big_df$z <- 1, big_dt[, `:=`(z, 1)])
## Unit: milliseconds
##                  expr      min       lq     mean   median      uq      max
##         big_df$z <- 1 1.325358 2.331905 5.860944 2.544618 4.37489 105.6464
##  big_dt[, `:=`(z, 1)] 2.536554 4.295727 8.677173 5.066026 7.62658 128.0590
##  neval cld
##    100   a
##    100   a

Once again, notice that := is actually a function, and z := 1 is parsed as ":="(z, 1).

Using by

We can also perform grouping like operations through the use of the by argument:

library(data.table)
DT <- data.table(x = 1:5, y = 6:10, gp = c("a", "a", "a", "b", "b"))
DT[, `:=`(z, mean(x + y)), by = gp]
##    x  y gp  z
## 1: 1  6  a  9
## 2: 2  7  a  9
## 3: 3  8  a  9
## 4: 4  9  b 14
## 5: 5 10  b 14

Notice that since mean(x+y) returns a scalar (numeric vector of length 1), it is recycled to fill within each group.

Generating a new data.table

What if, rather than modifying the current data.table, we wanted to generate a new one?

library(data.table)
DT <- data.table(x = 1:5, y = 6:10, gp = c("a", "a", "a", "b", "b"))
DT[, list(z = mean(x + y)), by = gp]
##    gp  z
## 1:  a  9
## 2:  b 14

Notice that we retain one row for each unique group specified in the by argument, and only the by variables along-side our z variable are returned.

The j expression

  • A list

... returns a new data.table, potentially subset over groups in your by.

  • A := Call

... modifies that data.table in place, hence saving memory. Output is recycled if the by argument is used.

In general, our j expression is either:

  1. an expression, with the final (or only) statement being a list of (named) arguments,

  2. a call to the := function, with

  • the first argument being names, and
  • the second argument being an expression, for which the last statement is a list of the same length as the first argument.

Special variables

There are a number of special variables defined only within j, that allow us to do some neat things...

library(data.table)
data.table()[, ls(all = TRUE)]
## [1] ".GRP" ".I"   ".N"   ".SD"

These variables allow us to infer a bit more about what's going on within the data.table calls, and also allow us to write more complicated j expressions.

Special variables

.SD

A data.table containing the subset of data for each group, excluding columns used in by.

.BY

A list containing a length 1 vector for each item in by.

.N

Special variables

An integer, length 1, containing the number of rows in .SD.

.I

A vector of indices, holding the row locations from which .SD was pulled from the parent DT.

.GRP

A counter telling you which group you're working with (1st, 2nd, 3rd...)

Example usage of .N - Counts

Compute the counts, by group, using data.table...

set.seed(123)
library(data.table)
library(microbenchmark)
DT <- data.table(x = sample(letters[1:3], 1e+05, TRUE))
DT[, .N, by = x]
##    x     N
## 1: a 33387
## 2: c 33201
## 3: b 33412
table(DT$x)
## 
##     a     b     c 
## 33387 33412 33201

Example usage of .N - Counts

library(data.table)
library(microbenchmark)
DT <- data.table(x = factor(sample(letters[1:3], 1e+05, TRUE)))
microbenchmark(tbl = table(DT$x), DT = DT[, .N, by = x])
## Unit: milliseconds
##  expr      min        lq      mean    median       uq       max neval cld
##   tbl 8.535709 10.253819 12.447354 11.502042 13.22827 25.109841   100   b
##    DT 1.112893  1.403274  1.780877  1.590881  1.90645  4.514918   100  a

Example usage of .SD - lapply-type calls

library(data.table)
DT <- data.table(x = rnorm(10), y = rnorm(10), z = rnorm(10), id = letters[1:10])
DT[, lapply(.SD, mean), .SDcols = c("x", "y", "z")]
##             x         y         z
## 1: -0.2511676 0.7089166 0.4615515
lapply(DT[, 1:3, with = FALSE], mean)
## $x
## [1] -0.2511676
## 
## $y
## [1] 0.7089166
## 
## $z
## [1] 0.4615515

Example usage of .SD - lapply-type calls

library(data.table)
library(microbenchmark)
DT <- data.table(x = rnorm(1e+05), y = rnorm(1e+05), z = sample(letters, 1e+05, replace = TRUE))
DT2 <- copy(DT)
setkey(DT2, "z")
microbenchmark(DT = DT[, lapply(.SD, mean), .SDcols = c("x", "y"), by = "z"], DT2 = DT2[, 
    lapply(.SD, mean), .SDcols = c("x", "y"), by = "z"])
## Unit: milliseconds
##  expr      min       lq    mean   median       uq      max neval cld
##    DT 2.013897 2.485728 2.99011 2.684291 3.390388 5.935036   100   b
##   DT2 1.995405 2.235107 2.69206 2.397125 2.779748 9.285968   100  a

setting a key can lead to faster grouping operations.

Keys

data.tables can be keyed, allowing for faster indexing and subsetting. Keys are also used for joins, as we'll see later.

library(data.table)
DT <- data.table(x = c("a", "a", "b", "c", "a"), y = rnorm(5))
setkey(DT, x)
DT["a"]  ## grabs rows corresponding to 'a'
##    x         y
## 1: a 1.3242525
## 2: a 3.2439986
## 3: a 0.6207716

Note that this does a binary search rather than a vector scan, which is much faster!

Key performance

library(data.table)
library(microbenchmark)
DF <- data.frame(key = sample(letters, 1e+06, TRUE), x = rnorm(1e+06))
DT <- data.table(DF)
setkey(DT, key)
identical(DT["a"]$x, DF[DF$key == "a", ]$x)
## [1] TRUE
microbenchmark(DT = DT["a"], DF = DF[DF$key == "a", ], times = 5)
## Unit: milliseconds
##  expr        min         lq      mean     median         uq        max
##    DT   3.009992   3.641566   3.88202   3.762855   3.949109   5.046576
##    DF 160.408577 189.953657 199.53043 202.004269 202.728430 242.557198
##  neval cld
##      5  a 
##      5   b

Further reading: you can set multiple keys with setkeyv as well.

Joins

data.table comes with many kinds of joins, implements through the merge.data.table function, and also through the [ syntax as well. We'll focus on using merge.

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2)
##    x y z
## 1: a 1 5
## 2: a 2 5
## 3: b 3 6

Overview of joins

Here is a quick summary of SQL joins, applicable to data.table too.

(Source: http://www.codeproject.com)

A left join

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all.x = TRUE)
##      x y  z
## 1:   a 1  5
## 2:   a 2  5
## 3:   b 3  6
## 4: dt1 4 NA

A right join

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all.y = TRUE)
##      x  y z
## 1:   a  1 5
## 2:   a  2 5
## 3:   b  3 6
## 4: dt2 NA 7

An outer join

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all = TRUE)  ## outer join
##      x  y  z
## 1:   a  1  5
## 2:   a  2  5
## 3:   b  3  6
## 4: dt1  4 NA
## 5: dt2 NA  7

An inner join

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all = FALSE)  ## inner join
##    x y z
## 1: a 1 5
## 2: a 2 5
## 3: b 3 6

Speed example

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = do.call(paste, expand.grid(letters, letters, letters, letters)), 
    y = rnorm(26^4))
DT2 <- DT1[sample(1:nrow(DT1), 1e+05), ]
setnames(DT2, c("x", "z"))
DF1 <- as.data.frame(DT1)
DF2 <- as.data.frame(DT2)
setkey(DT1, x)
setkey(DT2, x)
microbenchmark(DT = merge(DT1, DT2), DF = merge.data.frame(DF1, DF2), replications = 5)
## Unit: nanoseconds
##          expr       min          lq         mean    median           uq
##            DT  43588481  47473113.5 5.960710e+07  50444820   60420617.5
##            DF 864700111 947929331.0 1.068524e+09 997906140 1170151161.5
##  replications         6        38.5 1.876600e+02       174        245.5
##         max neval cld
##   166500058   100  b 
##  1944960095   100   c
##         651   100 a

Subset joins

We can also perform joins of two keyed data.tables using the [ operator. We perform right joins, so that e.g.

  • DT1[DT2]

is a right join of DT1 into DT2. These joins are typically a bit faster. Do note that the order of columns post-merge can be different, though.

Subset joins

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = do.call(paste, expand.grid(letters, letters, letters, letters)), 
    y = rnorm(26^4))
DT2 <- DT1[sample(1:nrow(DT1), 1e+05), ]
setnames(DT2, c("x", "z"))
setkey(DT1, x)
setkey(DT2, x)
tmp1 <- DT2[DT1]
setcolorder(tmp1, c("x", "y", "z"))
tmp2 <- merge(DT1, DT2, all.x = TRUE)
setcolorder(tmp2, c("x", "y", "z"))
identical(tmp1, tmp2)
## [1] TRUE

Subset joins can be faster

library(data.table)
library(microbenchmark)
DT1 <- data.table(x = do.call(paste, expand.grid(letters, letters, letters, letters)), 
    y = rnorm(26^4))
DT2 <- DT1[sample(1:nrow(DT1), 1e+05), ]
setnames(DT2, c("x", "z"))
setkey(DT1, x)
setkey(DT2, x)
microbenchmark(bracket = DT1[DT2], merge = merge(DT1, DT2, all.y = TRUE), times = 5)
## Unit: milliseconds
##     expr      min       lq     mean   median        uq       max neval cld
##  bracket 36.23743 37.09330  48.2612 44.13048  54.22463  69.62013     5  a 
##    merge 85.16531 91.30996 123.7399 97.66892 159.58336 184.97197     5   b

Subset joins and the j-expression

More importantly they can be used the j-expression simultaneously, which can be very convenient.

DT1 <- data.table(x = 1:5, y = 6:10, z = 11:15, key = "x")
DT2 <- data.table(x = 2L, y = 7, w = 1L, key = "x")
# 1) subset only essential/necessary cols
DT1[DT2, list(z)]
##     z
## 1: 12
# 2) create new col, i.y refer's to DT2's y col
DT1[DT2, list(newcol = y > i.y)]
##    newcol
## 1:  FALSE
# 3) also assign by reference with `:=`
DT1[DT2, `:=`(newcol, z - w)]
##    x  y  z newcol
## 1: 1  6 11     NA
## 2: 2  7 12     11
## 3: 3  8 13     NA
## 4: 4  9 14     NA
## 5: 5 10 15     NA

data.table and SQL

We can understand the usage of [ as SQL statements. From FAQ 2.16:

data.table Argument SQL Statement
i WHERE
j SELECT
:= UPDATE
by GROUP BY
i ORDER BY (in compound syntax)
i HAVING (in compound syntax)

Compound syntax refers to multiple subsetting calls, and generally isn't necessary until you really feel like a data.table expert:

DT[where,select|update,group by][having][order by][ ]...[ ]

data.table and SQL - Joins

Here is a quick summary table of joins in data.table.

SQL data.table
LEFT JOIN x[y]
RIGHT JOIN y[x]
INNER JOIN x[y, nomatch=0]
OUTER JOIN merge(x,y,all=TRUE)

data.table and SQL

It's worth noting that I really mean it when I say that data.table is like an in-memory data base. It will even perform some basic query optimization!

library(data.table)
options(datatable.verbose = TRUE)
DT <- data.table(x = 1:5, y = 1:5, z = 1:5, a = c("a", "a", "b", "b", "c"))
DT[, lapply(.SD, mean), by = a]
## Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
## lapply optimization changed j from 'lapply(.SD, mean)' to 'list(mean(x), mean(y), mean(z))'
## GForce optimized j to 'list(gmean(x), gmean(y), gmean(z))'
##    a   x   y   z
## 1: a 1.5 1.5 1.5
## 2: b 3.5 3.5 3.5
## 3: c 5.0 5.0 5.0
options(datatable.verbose = FALSE)

Some thoughts

The primary use of data.table is for efficient and elegant data manipulation including aggregation and joins.

library(data.table)
library(microbenchmark)
DT <- data.table(gp1 = sample(letters, 1e+06, TRUE), gp2 = sample(LETTERS, 1e+06, 
    TRUE), y = rnorm(1e+06))
microbenchmark(times = 5, DT = DT[, mean(y), by = list(gp1, gp2)], DF = with(DT, 
    tapply(y, paste(gp1, gp2), mean)))
## Unit: milliseconds
##  expr      min        lq      mean    median       uq      max neval cld
##    DT  25.2472  28.39406  34.69776  28.67093  41.2674  49.9092     5  a 
##    DF 320.3084 345.78784 413.33449 445.41037 471.8036 483.3623     5   b

Unlike "split-apply-combine" approaches such as plyr, data is never split in data.table! data.table applies the function to each subset recursively (in C for speed). This keeps the memory footprint low - which is very essential for "big data".

Other interesting convenience functions

  • like
DT = data.table(Name = c("Mary", "George", "Martha"), Salary = c(2, 3, 4))
# Use regular expressions
DT[Name %like% "^Mar"]
##      Name Salary
## 1:   Mary      2
## 2: Martha      4
  • set* functions set, setattr, setnames, setcolorder, setkey, setkeyv
setcolorder(DT, c("Salary", "Name"))
DT
##    Salary   Name
## 1:      2   Mary
## 2:      3 George
## 3:      4 Martha
  • DT[, (myvar):=NULL] remove a column
DT[, `:=`(Name, NULL)]
##    Salary
## 1:      2
## 2:      3
## 3:      4

Listing all tables

With data.table you can always list the tables that you've created, which will also return basic information on this tables including size, keys, nrows, etc.

tables()
##      NAME        NROW NCOL MB COLS         KEY
## [1,] big_dt 1,000,000    3 20 x,y,z           
## [2,] dt             3    3  1 x,y,z           
## [3,] DT             3    2  1 Name,Salary     
## [4,] DT1            5    4  1 x,y,z,newcol x  
## [5,] DT2            1    3  1 x,y,w        x  
## [6,] tmp1     456,976    3 32 x,y,z        x  
## [7,] tmp2     456,976    3 32 x,y,z        x  
## Total: 88MB

Bonuses: fread

data.table also comes with fread, a file reader much, much better than read.table or read.csv:

library(data.table)
library(microbenchmark)
big_df <- data.frame(x = rnorm(1e+06), y = rnorm(1e+06))
file <- tempfile()
write.table(big_df, file = file, row.names = FALSE, col.names = TRUE, sep = "\t", 
    quote = FALSE)
microbenchmark(fread = fread(file), r.t = read.table(file, header = TRUE, sep = "\t"), 
    times = 1)
## Unit: milliseconds
##   expr       min        lq      mean    median        uq       max neval
##  fread  331.6005  331.6005  331.6005  331.6005  331.6005  331.6005     1
##    r.t 7447.6770 7447.6770 7447.6770 7447.6770 7447.6770 7447.6770     1
unlink(file)

Bonuses: rbindlist

Use this function to rbind a list of data.frames, data.tables or lists:

library(data.table)
library(microbenchmark)
dfs <- replicate(100, data.frame(x = rnorm(10000), y = rnorm(10000)), simplify = FALSE)
all.equal(rbindlist(dfs), data.table(do.call(rbind, dfs)))
## [1] TRUE
microbenchmark(DT = rbindlist(dfs), DF = do.call(rbind, dfs), times = 5)
## Unit: milliseconds
##  expr       min        lq      mean     median         uq       max neval
##    DT  10.02123  10.46328   10.7023   10.62325   10.65706   11.7467     5
##    DF 787.99593 984.21012 1088.7448 1060.43489 1142.37993 1468.7029     5
##  cld
##   a 
##    b

Summary

To quote Matt Dowle

data.table builds on base R functionality to reduce 2 types of time :

  1. programming time (easier to write, read, debug and maintain)

  2. compute time

It has always been that way around, 1 before 2. The main benefit is the syntax: combining where, select|update and 'by' into one query without having to string along a sequence of isolated function calls. Reduced function calls. Reduced variable name repetition. Easier to understand.

Learning More

Databases and the Structured Query Language (SQL)

  • A database is an organized collection of datasets (tables).
  • A database management system (DBMS) is a software system designed to allow the definition, creation, querying, update, and administration of databases.
  • Well-known DBMSs include MySQL, PostgreSQL, SQLite, Microsoft SQL Server, Oracle, etc.
  • Relational DBMSs (RDBMs) store data in a set of related tables
  • Most RDBMs use some form of the Structured Query Language (SQL)

Why do we even need databases?

The Structured Query Language (SQL)

Although SQL is an ANSI (American National Standards Institute) standard, there are different flavors of the SQL language.

The data in RDBMS is stored in database objects called tables.

A table is a collection of related data entries and it consists of columns and rows.

Here we will use SQLite, which is a self contained relational database management system. In contrast to other database management systems, SQLite is not a separate process that is accessed from the client application (e.g. MySQL, PostgreSQL).

Using RSQLite

Here we will make use of the Bioconductor project to load and use an SQLite database.

# You only need to run this once. Install if require() fails.
source("http://bioconductor.org/biocLite.R")
require(org.Hs.eg.db) || biocLite("org.Hs.eg.db")
# Now we can use the org.Hs.eg.db to load a database
library(org.Hs.eg.db)
# Create a connection
Hs_con <- org.Hs.eg_dbconn()

Using RSQLite

# List tables
head(dbListTables(Hs_con))
## [1] "accessions"            "alias"                 "chrlengths"           
## [4] "chromosome_locations"  "chromosomes"           "cytogenetic_locations"
# Or using an SQLite command (NOTE: This is specific to SQLite)
head(dbGetQuery(Hs_con, "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;"))
##                    name
## 1            accessions
## 2                 alias
## 3            chrlengths
## 4  chromosome_locations
## 5           chromosomes
## 6 cytogenetic_locations

Using RSQLite

# What columns are available?
dbListFields(Hs_con, "gene_info")
## [1] "_id"       "gene_name" "symbol"
dbListFields(Hs_con, "alias")
## [1] "_id"          "alias_symbol"
# Or using SQLite dbGetQuery(Hs_con, 'PRAGMA table_info('gene_info');')

Using RSQLite

gc()
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  1840419  98.3    4643607 248.0  4276538 228.4
## Vcells 15889273 121.3   47075374 359.2 47071241 359.2
alias <- dbGetQuery(Hs_con, "SELECT * FROM alias;")
gc()
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  1943826 103.9    4643607 248.0  4276538 228.4
## Vcells 16187585 123.6   47075374 359.2 47071241 359.2
gene_info <- dbGetQuery(Hs_con, "SELECT * FROM gene_info;")
chromosomes <- dbGetQuery(Hs_con, "SELECT * FROM chromosomes;")

Using RSQLite

CD154_df <- dbGetQuery(Hs_con, "SELECT * FROM alias a JOIN gene_info g ON g._id = a._id WHERE a.alias_symbol LIKE 'CD154';")
gc()
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  1989679 106.3    4643607 248.0  4276538 228.4
## Vcells 16674744 127.3   47075374 359.2 47071241 359.2
CD40LG_alias_df <- dbGetQuery(Hs_con, "SELECT * FROM alias a JOIN gene_info g ON g._id = a._id WHERE g.symbol LIKE 'CD40LG';")
gc()
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  1989696 106.3    4643607 248.0  4276538 228.4
## Vcells 16674841 127.3   47075374 359.2 47071241 359.2

Some SQL Commands

SELECT

The SELECT is used to query the database and retrieve selected data that match the specific criteria that you specify:

SELECT column1 [, column2, ...] FROM tablename WHERE condition

ORDER BY

ORDER BY clause can order column name in either ascending (ASC) or descending (DESC) order.

Some SQL Commands

JOIN

There are times when we need to collate data from two or more tables. As with data.tables we can use LEFT/RIGHT/INNER JOINS

GROUP BY

The GROUP BY was added to SQL so that aggregate functions could return a result grouped by column values.

SELECT col_name, function (col_name) FROM table_name GROUP BY col_name

A "GROUP BY" example

dbGetQuery(Hs_con, "SELECT c.chromosome, COUNT(g.gene_name) AS count FROM chromosomes c JOIN gene_info g ON g._id = c._id WHERE c.chromosome IN (1,2,3,4,'X') GROUP BY c.chromosome ORDER BY count;")
##   chromosome count
## 1          4  1917
## 2          X  2154
## 3          3  2549
## 4          2  3131
## 5          1  4439

Some more SQL commands

Some other SQL statements that might be of used to you:

CREATE TABLE

The CREATE TABLE statement is used to create a new table.

DELETE

The DELETE command can be used to remove a record(s) from a table.

DROP

To remove an entire table from the database use the DROP command.

CREATE VIEW

A view is a virtual table that is a result of SQL SELECT statement. A view contains fields from one or more real tables in the database. This virtual table can then be queried as if it were a real table.

Creating your own SQLite database in R

db <- dbConnect(SQLite(), dbname = "./Data/SDY61/SDY61.sqlite")
dbWriteTable(conn = db, name = "hai", value = "./Data/SDY61/hai_result.txt", row.names = FALSE, 
    header = TRUE, sep = "\t", overwrite = TRUE)
## [1] TRUE
dbWriteTable(conn = db, name = "cohort", value = "./Data/SDY61/arm_or_cohort.txt", 
    row.names = FALSE, header = TRUE, sep = "\t", overwrite = TRUE)
## [1] TRUE

Creating your own SQLite database in R

dbListFields(db, "hai")
##  [1] "RESULT_ID"                 "ARM_ACCESSION"            
##  [3] "BIOSAMPLE_ACCESSION"       "EXPSAMPLE_ACCESSION"      
##  [5] "EXPERIMENT_ACCESSION"      "STUDY_ACCESSION"          
##  [7] "STUDY_TIME_COLLECTED"      "STUDY_TIME_COLLECTED_UNIT"
##  [9] "SUBJECT_ACCESSION"         "VALUE_REPORTED"           
## [11] "VIRUS_STRAIN"              "WORKSPACE_ID"
dbListFields(db, "cohort")
## [1] "ARM_ACCESSION"             "DESCRIPTION"              
## [3] "NAME"                      "POPULATION_SELECTION_RULE"
## [5] "SORT_ORDER"                "STUDY_ACCESSION"          
## [7] "TYPE"                      "WORKSPACE_ID"

Creating your own SQLite database in R

res <- dbGetQuery(db, "SELECT STUDY_TIME_COLLECTED, cohort.DESCRIPTION, MAX(VALUE_REPORTED) AS max_value FROM hai JOIN cohort ON hai.ARM_ACCESSION = cohort.ARM_ACCESSION WHERE cohort.DESCRIPTION LIKE '%TIV%' GROUP BY BIOSAMPLE_ACCESSION;")
head(res)
##   STUDY_TIME_COLLECTED                      DESCRIPTION max_value
## 1                    0 Healthy adults given TIV vaccine        80
## 2                    0 Healthy adults given TIV vaccine        40
## 3                    0 Healthy adults given TIV vaccine       160
## 4                    0 Healthy adults given TIV vaccine       160
## 5                    0 Healthy adults given TIV vaccine         5
## 6                    0 Healthy adults given TIV vaccine        40

Using data.table to perform the same operations

# Read the tables using fread
hai <- fread("./Data/SDY61/hai_result.txt")
cohort <- fread("./Data/SDY61/arm_or_cohort.txt")
## Set keys before joining
setkey(hai, "ARM_ACCESSION")
setkey(cohort, "ARM_ACCESSION")
## Inner join
dt_joined <- cohort[hai, nomatch = 0]
## Summarize values
head(dt_joined[DESCRIPTION %like% "TIV", list(max_value = max(VALUE_REPORTED)), by = "BIOSAMPLE_ACCESSION,STUDY_TIME_COLLECTED"])
##    BIOSAMPLE_ACCESSION STUDY_TIME_COLLECTED max_value
## 1:            BS586649                   28        40
## 2:            BS586655                   28       160
## 3:            BS586424                    0        40
## 4:            BS586408                    0         5
## 5:            BS586636                   28       640
## 6:            BS586059                   28      2560

The sqldf package

Sometimes it can be convenient to use SQL statements on dataframes. This is exactly what the sqldf package does.

library(sqldf)
data(iris)
sqldf("select * from iris limit 5")
sqldf("select count(*) from iris")
sqldf("select Species, count(*) from iris group by Species")

The sqldf package can even provide increased speed over pure R operations.

dplyr

dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R.

  • Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.

  • Provide blazing fast performance for in-memory data by writing key pieces in C++.

  • Use the same interface to work with data no matter where it's stored, whether in a data frame, a data table or database.

dplyr is verbose!

dplyr implements the following verbs useful for data manipulation:

  • select(): focus on a subset of variables
  • filter(): focus on a subset of rows
  • mutate(): add new columns
  • summarise(): reduce each group to a smaller number of summary statistics
  • arrange(): re-order the rows

dplyr with SQLite

suppressMessages(library(dplyr))
my_db <- src_sqlite("./Data/SDY61/SDY61.sqlite", create = T)
hai_sql <- tbl(my_db, "hai")

Select

select(hai_sql, ARM_ACCESSION, BIOSAMPLE_ACCESSION)
## Source: sqlite 3.8.6 [./Data/SDY61/SDY61.sqlite]
## From: hai [390 x 2]
## 
##    ARM_ACCESSION BIOSAMPLE_ACCESSION
## 1         ARM457            BS586649
## 2         ARM456            BS586611
## 3         ARM457            BS586655
## 4         ARM456            BS586392
## 5         ARM456            BS586401
## 6         ARM457            BS586424
## 7         ARM457            BS586408
## 8         ARM457            BS586636
## 9         ARM457            BS586059
## 10        ARM456            BS586622
## ..           ...                 ...

Filter

filter(hai_sql, value_reported > 10)
## Source: sqlite 3.8.6 [./Data/SDY61/SDY61.sqlite]
## From: hai [178 x 12]
## Filter: value_reported > 10 
## 
##    RESULT_ID ARM_ACCESSION BIOSAMPLE_ACCESSION EXPSAMPLE_ACCESSION
## 1       1563        ARM457            BS586649            ES495456
## 2        592        ARM456            BS586611            ES495390
## 3       1569        ARM457            BS586655            ES495462
## 4       1646        ARM457            BS586424            ES495595
## 5       1662        ARM457            BS586636            ES495555
## 6       1721        ARM457            BS586059            ES495642
## 7        491        ARM456            BS586622            ES495289
## 8       1526        ARM457            BS586416            ES495475
## 9        495        ARM456            BS586626            ES495293
## 10      1635        ARM457            BS586413            ES495584
## ..       ...           ...                 ...                 ...
## Variables not shown: EXPERIMENT_ACCESSION (chr), STUDY_ACCESSION (chr),
##   STUDY_TIME_COLLECTED (int), STUDY_TIME_COLLECTED_UNIT (chr),
##   SUBJECT_ACCESSION (chr), VALUE_REPORTED (int), VIRUS_STRAIN (chr),
##   WORKSPACE_ID (int)

Summarize

summarize(group_by(hai_sql, SUBJECT_ACCESSION, STUDY_TIME_COLLECTED), mean(VALUE_REPORTED))
## Source: sqlite 3.8.6 [./Data/SDY61/SDY61.sqlite]
## From: <derived table> [?? x 3]
## Grouped by: SUBJECT_ACCESSION 
## 
##    SUBJECT_ACCESSION STUDY_TIME_COLLECTED mean(VALUE_REPORTED)
## 1          SUB112829                    0             33.33333
## 2          SUB112829                   28             40.00000
## 3          SUB112830                    0             43.33333
## 4          SUB112830                   28            270.00000
## 5          SUB112831                    0              5.00000
## 6          SUB112831                   28              5.00000
## 7          SUB112832                    0             14.16667
## 8          SUB112832                   28            141.66667
## 9          SUB112833                    0             81.66667
## 10         SUB112833                   28             81.66667
## ..               ...                  ...                  ...

dplyr on data.tables

summarize(group_by(hai, SUBJECT_ACCESSION, STUDY_TIME_COLLECTED), mean(VALUE_REPORTED))
## Source: local data table [120 x 3]
## Groups: SUBJECT_ACCESSION
## 
##    SUBJECT_ACCESSION STUDY_TIME_COLLECTED mean(VALUE_REPORTED)
## 1          SUB112829                    0             33.33333
## 2          SUB112829                   28             40.00000
## 3          SUB112830                    0             43.33333
## 4          SUB112830                   28            270.00000
## 5          SUB112831                    0              5.00000
## 6          SUB112831                   28              5.00000
## 7          SUB112832                    0             14.16667
## 8          SUB112832                   28            141.66667
## 9          SUB112833                    0             81.66667
## 10         SUB112833                   28             81.66667
## ..               ...                  ...                  ...

dplyr and magrittr

dplyr can use piping operations via the magrittr package, as follows,

library(magrittr)
hai %>% group_by(SUBJECT_ACCESSION, STUDY_TIME_COLLECTED) %>% summarize(hai_mean = mean(VALUE_REPORTED)) %>% 
    filter(STUDY_TIME_COLLECTED == 28)
## Source: local data table [60 x 3]
## Groups: SUBJECT_ACCESSION
## 
##    SUBJECT_ACCESSION STUDY_TIME_COLLECTED   hai_mean
## 1          SUB112829                   28  40.000000
## 2          SUB112830                   28 270.000000
## 3          SUB112831                   28   5.000000
## 4          SUB112832                   28 141.666667
## 5          SUB112833                   28  81.666667
## 6          SUB112834                   28  93.333333
## 7          SUB112835                   28 241.666667
## 8          SUB112836                   28   5.000000
## 9          SUB112837                   28 861.666667
## 10         SUB112838                   28   5.000000
## 11         SUB112839                   28 120.000000
## 12         SUB112840                   28  10.000000
## 13         SUB112841                   28 273.333333
## 14         SUB112842                   28 106.666667
## 15         SUB112843                   28 573.333333
## 16         SUB112844                   28  15.000000
## 17         SUB112845                   28 119.166667
## 18         SUB112846                   28  10.000000
## 19         SUB112847                   28 626.666667
## 20         SUB112848                   28 293.333333
## 21         SUB112849                   28  31.666667
## 22         SUB112850                   28   6.666667
## 23         SUB112851                   28   5.000000
## 24         SUB112852                   28 266.666667
## 25         SUB112853                   28 213.333333
## 26         SUB112854                   28   6.666667
## 27         SUB112855                   28 266.666667
## 28         SUB112856                   28 280.000000
## 29         SUB112857                   28 133.333333
## 30         SUB112858                   28   5.000000
## 31         SUB112859                   28   5.000000
## 32         SUB112860                   28  21.666667
## 33         SUB112861                   28 186.666667
## 34         SUB112862                   28  11.666667
## 35         SUB112863                   28  80.000000
## 36         SUB112864                   28  61.666667
## 37         SUB112865                   28  10.000000
## 38         SUB112866                   28  35.000000
## 39         SUB112867                   28   5.000000
## 40         SUB112868                   28  35.000000
## 41         SUB112869                   28  16.666667
## 42         SUB112870                   28   5.000000
## 43         SUB112871                   28  28.333333
## 44         SUB112872                   28  33.333333
## 45         SUB112873                   28   5.000000
## 46         SUB112874                   28 110.000000
## 47         SUB112875                   28  30.000000
## 48         SUB112876                   28  56.666667
## 49         SUB112877                   28  28.333333
## 50         SUB112878                   28 106.666667
## 51         SUB112879                   28  68.333333
## 52         SUB112880                   28  30.000000
## 53         SUB112881                   28   5.000000
## 54         SUB112882                   28  35.000000
## 55         SUB112883                   28  10.000000
## 56         SUB112884                   28  35.000000
## 57         SUB112885                   28   5.000000
## 58         SUB112886                   28   5.000000
## 59         SUB112887                   28  55.000000
## 60         SUB112888                   28  53.333333

Summary

  • R base data.frames are convenient but often not adapted to large dataset manipulation (e.g. genomics).

  • Thankfully, there are good alternatives. My recommendtion is:

    • Use data.table for your day-to-day operations
    • When you have many tables and a complex schema, use sqlite.

Note: There many other R packages for "big data" such the bigmemory suite, biglm, ff, RNetcdf, rhdf5, etc.