Skip to content

DataBiosphere/topmed-workflows

Repository files navigation

Topmed Workflows

About

The original pipelines were assembled and written by Hyun Min Kang ([email protected]) and Adrian Tan ([email protected]) at the Abecasis Lab at the University of Michigan

See the variant calling pipeline and alignment pipeline repositories

Installing dependencies on your local system

1. Cloud SDK (gcloud, gsutil)

If you are on Debian / Ubuntu, follow the instructions on Cloud SDK. After you execute gcloud init the installer asks you to log in and you should respond with Y, head to the provided URL, copy the code and past it to the prompt. After that it will ask you for the cloud project you want to use, so you need to input the GCP Project ID. I picked us-west1-b as the region.

Configuration and credentials file

export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
gcloud auth login

After that run gcloud auth application-default --help and follow the instructions. Briefly, run

gcloud iam service-accounts create <pick-a-username>
gcloud iam service-accounts keys create key.json --iam-account=<the-username-you-just-picked>@<your-service-account-name>.iam.gserviceaccount.com

That should print something like

created key [<some long integer>] of type [json] as [key.json] for [<username-you-picked>@<your-service-account-name>.iam.gserviceaccount.com]

You can check in the Google Cloud Platform console under IAM Service Accounts. That account you just created should be in the list.

Next create an environment variable that points to the file key.json:

export GOOGLE_APPLICATION_CREDENTIALS=key.json

Providing credentials to your application

To run workflows of data stored on gcloud you need to set an environment variable GOOGLE_APPLICATION_CREDENTIALS, which holds the path to the credentials file.

2. Broad's execution engine cromwell

cromwell is a Java executable and requires a Java Runtime Engine. Follow the instruction here for a complete installation.

3. Dockstore

For Dockstore to run you need to install the Java Runtime Engine. Find installation instructions for Dockstore here.

Running workflows

Provisioning reference files

To copy contents of a SDK bucket to your local system (or a VM) use

gsutil -u [PROJECT_ID] cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]

Checker workflows

A WDL and a JSON file to test checker workflows are in the test_data directory. You need to adjust all paths in the JSON file to the paths on your system before running the checker. It has been tested with cromwell-31.jar. To run the checker workflow for the WDL aligner navigate to respective directory (usually it has checker in its name) and run

java -Dconfig.file=<location_to_file> -jar ~/bin/<cromwell_version>.jar run <checker-workflow>.wdl -i  <checker-workflow>.json

Cost estimates for Terra

Please keep in mind that your costs may vary depending on how your data is formatted and what parameters you use. In addition, if you are using preemptibles, there is some element of randomness here -- a preemptible may or may not be stopped by Google at any given time, causing an in-progress task to need to restart.

Aligner (WDL)

When running the aligner workflow on 10 full-size CRAMs from the PharmaHD study imported from Gen3, running on the aligner's default settings, the cost was $80.38 as reported by Terra. The most expensive of those ten files cost $10.82 and the least expensive cost $5.74.

Aligner Checker (WDL)

As the aligner checker runs the aligner and then simply preforms an md5sum, the cost for the aligner checker will be about the same as that of the aligner.