dependentsimr

Generating Correlated Data for Omics Simulation

Omics data is in the “p >> n” regime where there are fewer samples than measurements per sample. This creates dual challenges in generating realistic simulated data for the purposes of benchmarking. First, there isn’t enough data to be able to compute a dependence structure (e.g., a full-rank correlation matrix). Second, generating omics-scale data with a specified correlation matrix is slow due to the typical $O(p^3)$ nature of these algorithms. These often mean that simulators assume independence of the measurements, which does not reflect reality.

Here, we give a simple solution to both of these problems by using a low-rank correlation matrix to both approximate realistic dependencies in a real dataset and generate simulated data mimicking the real data. Using a NORTA (Normal to Anything) approach, the marginal (univariate) distributions can have realistic forms like the negative binomial appropriate for omics datasets like RNA-seq read counts. Our implementation supports normal, Poisson, DESeq2-based (negative binomial with sample-specific size factors), and empirical (for ordinal data) marginal distributions. This makes it particularly suited to RNA-seq data but also widely applicable.

Using this, simulating data that matches a given data (with samples in columns and measurements/features in rows) with each feature having a negative binomial distribution is fast and simple:

library(dependentsimr)
head(read_counts) # An RNA-seq dataset
rs <- get_random_structure(list(counts=as.matrix(read_counts[,-1])), method="pca", rank=2, type="DESeq2")
simulated_data <- draw_from_multivariate_corr(rs, n_samples=5)$counts
head(simulated_data)

Finally, this also supports simultaneously generating multiple ‘modes’ of data, such as happens in multi-omics, where each node can have a distinct marginal distribution type. For example, proteomics might have normal margins and RNA-seq the DESeq2 margins. This captures cross-mode dependencies observed in the data as well as intra-mode.

Installation

The package is available from CRAN:

install.packages("dependentsimr")

Alternatively, you can install the latest development version of dependentsimr from GitHub with:

# install.packages("remotes")
remotes::install_github("tgbrooks/dependentsimr")

Vignette

For an extended example of using this package to generate RNA-seq data, please see this vignette.

Future directions

The methods described in our paper and implemented here are quite general and are computationally efficient for genome-scale datasets. The implementation is already largely flexibile enough to fit most omics datasets, however, some additional developments may be needed to scale to new data types. For example, single-cell RNA-seq is sometimes modelled by using a zero-inflated negative binomial (ZINB), which is not implemented here. Such a modification could be made in the marginals_and_transform and draw_from_multivariate_corr functions relatively easily, and likewise for any other marginal distributions required. Moreover, raw single-cell datasets often have distinct cell-types which means that cell-cell variance is not well captured by the Gaussian copula approach: the solution to this is to separate out cell-types and simulate them separately.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
R		R
data		data
doc		doc
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
cran-comments.md		cran-comments.md
dependent_sim.Rproj		dependent_sim.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

dependentsimr

Generating Correlated Data for Omics Simulation

Installation

Vignette

Future directions

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

tgbrooks/dependentsimr

Folders and files

Latest commit

History

Repository files navigation

dependentsimr

Generating Correlated Data for Omics Simulation

Installation

Vignette

Future directions

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages