Skip to content

tgbrooks/dependentsimr

Repository files navigation

dependentsimr

Generating Correlated Data for Omics Simulation

Omics data is in the “p >> n” regime where there are fewer samples than measurements per sample. This creates dual challenges in generating realistic simulated data for the purposes of benchmarking. First, there isn’t enough data to be able to compute a dependence structure (e.g., a full-rank correlation matrix). Second, generating omics-scale data with a specified correlation matrix is slow due to the typical $O(p^3)$ nature of these algorithms. These often mean that simulators assume independence of the measurements, which does not reflect reality.

Here, we give a simple solution to both of these problems by using a low-rank correlation matrix to both approximate realistic dependencies in a real dataset and generate simulated data mimicking the real data. Using a NORTA (Normal to Anything) approach, the marginal (univariate) distributions can have realistic forms like the negative binomial appropriate for omics datasets like RNA-seq read counts. Our implementation supports normal, Poisson, DESeq2-based (negative binomial with sample-specific size factors), and empirical (for ordinal data) marginal distributions. This makes it particularly suited to RNA-seq data but also widely applicable.

Using this, simulating data that matches a given data (with samples in columns and measurements/features in rows) with each feature having a negative binomial distribution is fast and simple:

library(dependentsimr)
head(read_counts) # An RNA-seq dataset
rs <- get_random_structure(list(counts=as.matrix(read_counts[,-1])), method="pca", rank=2, type="DESeq2")
simulated_data <- draw_from_multivariate_corr(rs, n_samples=5)$counts
head(simulated_data)

Finally, this also supports simultaneously generating multiple ‘modes’ of data, such as happens in multi-omics, where each node can have a distinct marginal distribution type. For example, proteomics might have normal margins and RNA-seq the DESeq2 margins. This captures cross-mode dependencies observed in the data as well as intra-mode.

Installation

The package is available from CRAN:

install.packages("dependentsimr")

Alternatively, you can install the latest development version of dependentsimr from GitHub with:

# install.packages("remotes")
remotes::install_github("tgbrooks/dependentsimr")

Vignette

For an extended example of using this package to generate RNA-seq data, please see this vignette.

Future directions

The methods described in our paper and implemented here are quite general and are computationally efficient for genome-scale datasets. The implementation is already largely flexibile enough to fit most omics datasets, however, some additional developments may be needed to scale to new data types. For example, single-cell RNA-seq is sometimes modelled by using a zero-inflated negative binomial (ZINB), which is not implemented here. Such a modification could be made in the marginals_and_transform and draw_from_multivariate_corr functions relatively easily, and likewise for any other marginal distributions required. Moreover, raw single-cell datasets often have distinct cell-types which means that cell-cell variance is not well captured by the Gaussian copula approach: the solution to this is to separate out cell-types and simulate them separately.

About

Generating Correlated Data for Omics Simulation

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages