GitHub - BenjaminATaylor/Zuzu: A differential gene expression benchmarking pipeline

ZUZU is a work in progress!

After a user provides a gene expression dataset, Zuzu will operates on that dataset as follows. Note that except where the output of one step is necessary input for a later step, each step is run in parallel, allowing the overall pipeline to run very quickly in a cluster environment.

Differentially-expressed genes (DEGs) between a pre-specified set of conditions are identified using every analysis method individually. Notably, if the user wishes simply to identify DEGs for their dataset, either from a single method or as a consensus across multiple methods, this step alone streamlines that process to a single command.
The original input dataset is fully permuted, with expression values for each gene shuffled randomly among samples. This process approximates a group of genes with the same overall variance and distribution of expression as in the original dataset, but with no true signal of phenotype. Each differential expression analysis method is applied to this permuted dataset and the resulting set of DEGs compared to that generated in step 1, to quantify the degree to which each method generates exaggerated false positives of the kind described in Li et al. (2022). This step is repeated a number of times specified by the user in order to account for stochastic effects of permutation.
The original input dataset is subjected to a quasi-permutation procedure in order to assess each method's ability to identify true positives. In brief, differential expression analysis for each method is re-run with a very conservative threshold for DEG identification, permitting the generation of a set of very high-confidence DEGs with each method. A random half of these 'true' DEGs are retained, while all other genes are permuted across samples as in Step 2. This generates a new dataset consisting of a small number of high-confidence 'true' DEGs and a larger number of permuted genes that should exhibit no phenotypic signal. DEGs for this quasi-permuted dataset are generated once again through the given method, and the resulting set of DEGs are compared against the known 'true' DEGs. These data are then used to calculate three key benchmarks for the given method: Power, the proportion of true positives correctly classified by the method; False Discovery Proportion (FDP), the proportion of false positives among all DEGs identified by the given method; and Receiver Operating Characteristic Area-Under-Curve (ROC AUC), an aggregate measure of classification performance that balances power against false discovery rate. These metrics can also be calculated for each method at smaller sample sizes, allowing us to assess whether certain methods are better able to accommodate small vs large datasets. As in Step 2, this whole step is performed repeatedly to account for stochasticity in permutation and choice of DEGs.
The original input dataset is used as a template to generate a synthetic expression dataset with the same number of individuals and same approximate sequencing depth; this step is achieved using the compcodeR package in R. Because this is a wholly synthetic dataset, true DEGs are known prima facie and the metric described in Step 3 can be calculated with more confidence and precision than is possible for a nonsynthetic dataset. In addition to calculating these metrics for each method across a range of sample sizes as in Step 3, the synthetic dataset can be altered to increase or decrease its read depth, allowing the user to assess whether particular methods perform well with more or less sparse data.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
images		images
input		input
modules		modules
tempfigs		tempfigs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
README.md		README.md
SVC.py		SVC.py
SVM_test.R		SVM_test.R
WHATIDID.txt		WHATIDID.txt
cleaning.R		cleaning.R
custom_genesynth_test.R		custom_genesynth_test.R
depth.RData		depth.RData
edgeRtest.R		edgeRtest.R
fullsynth_test.R		fullsynth_test.R
humandata_test.R		humandata_test.R
interobserver_test.R		interobserver_test.R
main.nf		main.nf
nextflow.config		nextflow.config
perm_edit_tmp.R		perm_edit_tmp.R
permuteplot_test.R		permuteplot_test.R
powsimr_test.R		powsimr_test.R
python_test.py		python_test.py
quasiplot_test.R		quasiplot_test.R
seqgendiff_test.R		seqgendiff_test.R
simseq_test.R		simseq_test.R
sklearn_exemplar.py		sklearn_exemplar.py
sklearn_exemplar_genesynth.py		sklearn_exemplar_genesynth.py
sklearn_test.py		sklearn_test.py
slearn_test_tumor.py		slearn_test_tumor.py
spsimseq_test.R		spsimseq_test.R
synthcounts.csv		synthcounts.csv
synthdata_eval_test.R		synthdata_eval_test.R
synthsheet.csv		synthsheet.csv
test_x.csv		test_x.csv
test_x_transform.csv		test_x_transform.csv
test_y.csv		test_y.csv
testcoefs.csv		testcoefs.csv
testcommand		testcommand
trueDEGs.RData		trueDEGs.RData
tumordata_test.R		tumordata_test.R
wilcox_test.R		wilcox_test.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

BenjaminATaylor/Zuzu

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages