Skip to content

BenjaminATaylor/Zuzu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZUZU is a work in progress!

Zuzu schema

After a user provides a gene expression dataset, Zuzu will operates on that dataset as follows. Note that except where the output of one step is necessary input for a later step, each step is run in parallel, allowing the overall pipeline to run very quickly in a cluster environment.

  1. Differentially-expressed genes (DEGs) between a pre-specified set of conditions are identified using every analysis method individually. Notably, if the user wishes simply to identify DEGs for their dataset, either from a single method or as a consensus across multiple methods, this step alone streamlines that process to a single command.
  2. The original input dataset is fully permuted, with expression values for each gene shuffled randomly among samples. This process approximates a group of genes with the same overall variance and distribution of expression as in the original dataset, but with no true signal of phenotype. Each differential expression analysis method is applied to this permuted dataset and the resulting set of DEGs compared to that generated in step 1, to quantify the degree to which each method generates exaggerated false positives of the kind described in Li et al. (2022). This step is repeated a number of times specified by the user in order to account for stochastic effects of permutation.
  3. The original input dataset is subjected to a quasi-permutation procedure in order to assess each method's ability to identify true positives. In brief, differential expression analysis for each method is re-run with a very conservative threshold for DEG identification, permitting the generation of a set of very high-confidence DEGs with each method. A random half of these 'true' DEGs are retained, while all other genes are permuted across samples as in Step 2. This generates a new dataset consisting of a small number of high-confidence 'true' DEGs and a larger number of permuted genes that should exhibit no phenotypic signal. DEGs for this quasi-permuted dataset are generated once again through the given method, and the resulting set of DEGs are compared against the known 'true' DEGs. These data are then used to calculate three key benchmarks for the given method: Power, the proportion of true positives correctly classified by the method; False Discovery Proportion (FDP), the proportion of false positives among all DEGs identified by the given method; and Receiver Operating Characteristic Area-Under-Curve (ROC AUC), an aggregate measure of classification performance that balances power against false discovery rate. These metrics can also be calculated for each method at smaller sample sizes, allowing us to assess whether certain methods are better able to accommodate small vs large datasets. As in Step 2, this whole step is performed repeatedly to account for stochasticity in permutation and choice of DEGs.
  4. The original input dataset is used as a template to generate a synthetic expression dataset with the same number of individuals and same approximate sequencing depth; this step is achieved using the compcodeR package in R. Because this is a wholly synthetic dataset, true DEGs are known prima facie and the metric described in Step 3 can be calculated with more confidence and precision than is possible for a nonsynthetic dataset. In addition to calculating these metrics for each method across a range of sample sizes as in Step 3, the synthetic dataset can be altered to increase or decrease its read depth, allowing the user to assess whether particular methods perform well with more or less sparse data.

About

A differential gene expression benchmarking pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published