Omics data is in the “p >> n” regime where there are fewer samples
than measurements per sample. This creates dual challenges in generating
realistic simulated data for the purposes of benchmarking. First, there
isn’t enough data to be able to compute a dependence structure (e.g., a
full-rank correlation matrix). Second, generating omics-scale data with
a specified correlation matrix is slow due to the typical
Here, we give a simple solution to both of these problems by using a low-rank correlation matrix to both approximate realistic dependencies in a real dataset and generate simulated data mimicking the real data. Using a NORTA (Normal to Anything) approach, the marginal (univariate) distributions can have realistic forms like the negative binomial appropriate for omics datasets like RNA-seq read counts. Our implementation supports normal, Poisson, DESeq2-based (negative binomial with sample-specific size factors), and empirical (for ordinal data) marginal distributions. This makes it particularly suited to RNA-seq data but also widely applicable.
Using this, simulating data that matches a given data
(with samples in
columns and measurements/features in rows) with each feature having a
negative binomial distribution is fast and simple:
library(dependentsimr)
head(read_counts) # An RNA-seq dataset
rs <- get_random_structure(list(counts=as.matrix(read_counts[,-1])), method="pca", rank=2, type="DESeq2")
simulated_data <- draw_from_multivariate_corr(rs, n_samples=5)$counts
head(simulated_data)
Finally, this also supports simultaneously generating multiple ‘modes’ of data, such as happens in multi-omics, where each node can have a distinct marginal distribution type. For example, proteomics might have normal margins and RNA-seq the DESeq2 margins. This captures cross-mode dependencies observed in the data as well as intra-mode.
The package is available from CRAN:
install.packages("dependentsimr")
Alternatively, you can install the latest development version of dependentsimr from GitHub with:
# install.packages("remotes")
remotes::install_github("tgbrooks/dependentsimr")
For an extended example of using this package to generate RNA-seq data, please see this vignette.
The methods described in our paper and implemented here are quite
general and are computationally efficient for genome-scale datasets. The
implementation is already largely flexibile enough to fit most omics
datasets, however, some additional developments may be needed to scale
to new data types. For example, single-cell RNA-seq is sometimes
modelled by using a zero-inflated negative binomial (ZINB), which is not
implemented here. Such a modification could be made in the
marginals_and_transform
and draw_from_multivariate_corr
functions
relatively easily, and likewise for any other marginal distributions
required. Moreover, raw single-cell datasets often have distinct
cell-types which means that cell-cell variance is not well captured by
the Gaussian copula approach: the solution to this is to separate out
cell-types and simulate them separately.