TOPMed Analysis Pipeline — WDL Version

This project is a Workflow Description Language (WDL) implementation of several components of the University of Washington TOPMed pipeline, purposefully done in a way that closely mimics the CWL version of the UW Pipeline. In other words, this is a WDL that mimics a CWL that mimics a Python pipeline. All three pipelines use the same underlying R scripts which do most of the actual analysis, making their results directly comparable. We have also used checker workflows to verify that results are scientifically equivalent.

Features

This pipeline is very similar to the CWL version, and while the main differences between the two are documented, testing indicates they are functionally equivalent -- so much so that files generated by the CWL are used as truth files for the WDL
As it works in a Docker container, it does not have any external dependencies other than the usual setup required for WDL
Contains multiple checker workflows for validating sets of known inputs and expected outputs
Open-access sample data is provided, based upon sample data provided by UWGAC, itself based upon 1000 Genomes data
Autoscaling of executor's disk size based upon the size of input files, with the option for the user to add more storage on top of that
Support for preemptible VMs on Google backends
Documentation of inputs, how each workflow works, and WDL-specific workarounds

Usage

These workflows are tested on both Terra and a local installation of Cromwell. Example files are provided in test-data-and-truths and in gs://topmed_workflow_testing/UWGAC_WDL/.

Essentially all workflows which take in chromosome-level files share filename requirements. For these files, the chromosome must be included in the filename with the format chr## where ## is the name of the chromosome (1-24 or X, Y). Chromosome can be included at any part of the filename provided they follow this format. For instance, data_subset_chr1.gds, data_chr1_subset.gds, and chr1_data_subset.gds are all valid names, while data_chromosome1_subset.gds and data_subset_c1.gds are not valid. Note that the association aggregate, LD prune, and null model workflows additionally require that you have greater than one input GDS file (ie, input at least chr1 and chr2).

For more information on specific runtime attributes for specific tasks, see the further reading section. The default runtime attributes provided in these pipelines were based on the provided test data, which is probably much smaller than what you will be using. With that in mind, if you are using your own data, please be sure to adjust your runtime attributes appropriately.

Running on Terra (recommended)

For Terra users, it is recommended to import via Dockstore. Importing the correct JSON file for your workflow at the workflow field entry page will fill in test data and recommended runtime attributes for said test data. For example, load vcf-to-gds-terra.json for vcf-to-gds.wdl.

Running on your local machine

Much preliminary testing and development of these pipelines was done by running Cromwell in "local mode," but we do not recommend this approach for doing actual analysis. Cromwell does not manage resources well on local executions. As a result, these pipelines (LD pruning especially) may get their processes killed by your OS and/or lock up Docker, even if running on downsampled data. These issues can generally be avoided by changing the concurrent job limit in your Cromwell configuration, and if you set this limit, you should find that all of the sample data in this repo should run on any Cromwell-and-Docker compatible machine. See instructions here for how to set the concurrent job limit in the Dockstore CLI.

Running on an HPC

These workflows have not been extensively tested in an HPC environment, but provided your HPC supports Cromwell and Docker, they should work as expected. You may wish to run the checker workflows before doing actual analysis to ensure everything is running smoothly.

Name		Name	Last commit message	Last commit date
Latest commit History 733 Commits
_documentation_		_documentation_
_test-data-and-truths_		_test-data-and-truths_
assoc-aggregate		assoc-aggregate
king		king
ld-pruning		ld-pruning
null-model		null-model
pc-air		pc-air
pc-relate		pc-relate
vcf-to-gds		vcf-to-gds
.dockstore.yml		.dockstore.yml
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOPMed Analysis Pipeline — WDL Version

Features

Usage

Running on Terra (recommended)

Running on your local machine

Running on an HPC

Further reading

Contact

About

Releases 14

Packages

Contributors 3

Languages

DataBiosphere/analysis_pipeline_WDL

Folders and files

Latest commit

History

Repository files navigation

TOPMed Analysis Pipeline — WDL Version

Features

Usage

Running on Terra (recommended)

Running on your local machine

Running on an HPC

Further reading

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases 14

Packages 0

Contributors 3

Languages

Packages