This project is a Workflow Description Language (WDL) implementation of several components of the University of Washington TOPMed pipeline, purposefully done in a way that closely mimics the CWL version of the UW Pipeline. In other words, this is a WDL that mimics a CWL that mimics a Python pipeline. All three pipelines use the same underlying R scripts which do most of the actual analysis, making their results directly comparable. We have also used checker workflows to verify that results are scientifically equivalent.
- This pipeline is very similar to the CWL version, and while the main differences between the two are documented, testing indicates they are functionally equivalent -- so much so that files generated by the CWL are used as truth files for the WDL
- As it works in a Docker container, it does not have any external dependencies other than the usual setup required for WDL
- Contains multiple checker workflows for validating sets of known inputs and expected outputs
- Open-access sample data is provided, based upon sample data provided by UWGAC, itself based upon 1000 Genomes data
- Autoscaling of executor's disk size based upon the size of input files, with the option for the user to add more storage on top of that
- Support for preemptible VMs on Google backends
- Documentation of inputs, how each workflow works, and WDL-specific workarounds
These workflows are tested on both Terra and a local installation of Cromwell. Example files are provided in test-data-and-truths
and in gs://topmed_workflow_testing/UWGAC_WDL/
.
Essentially all workflows which take in chromosome-level files share filename requirements. For these files, the chromosome must be included in the filename with the format chr##
where ##
is the name of the chromosome (1-24 or X, Y). Chromosome can be included at any part of the filename provided they follow this format. For instance, data_subset_chr1.gds, data_chr1_subset.gds, and chr1_data_subset.gds are all valid names, while data_chromosome1_subset.gds and data_subset_c1.gds are not valid. Note that the association aggregate, LD prune, and null model workflows additionally require that you have greater than one input GDS file (ie, input at least chr1 and chr2).
For more information on specific runtime attributes for specific tasks, see the further reading section. The default runtime attributes provided in these pipelines were based on the provided test data, which is probably much smaller than what you will be using. With that in mind, if you are using your own data, please be sure to adjust your runtime attributes appropriately.
For Terra users, it is recommended to import via Dockstore. Importing the correct JSON file for your workflow at the workflow field entry page will fill in test data and recommended runtime attributes for said test data. For example, load vcf-to-gds-terra.json
for vcf-to-gds.wdl
.
Much preliminary testing and development of these pipelines was done by running Cromwell in "local mode," but we do not recommend this approach for doing actual analysis. Cromwell does not manage resources well on local executions. As a result, these pipelines (LD pruning especially) may get their processes killed by your OS and/or lock up Docker, even if running on downsampled data. These issues can generally be avoided by changing the concurrent job limit in your Cromwell configuration, and if you set this limit, you should find that all of the sample data in this repo should run on any Cromwell-and-Docker compatible machine. See instructions here for how to set the concurrent job limit in the Dockstore CLI.
These workflows have not been extensively tested in an HPC environment, but provided your HPC supports Cromwell and Docker, they should work as expected. You may wish to run the checker workflows before doing actual analysis to ensure everything is running smoothly.
general notes
- documentation on checker workflows
- documentation on CWL-WDL differences for users
- documentation on CWL-WDL differences for advanced users/devs
workflow-specific
- Association testing -- aggregate: assoc-aggregate
- Kinship: KING IBDSEG
- Linkage disequilibrium pruning: ld-pruning
- Null model generation: null-model
- pc-air
- pc-relate
- VCF to GDS file conversion: vcf-to-gds
Ash O'Farrell ([email protected])