Tutorial for Stanford SCG cluster

All test samples and genome data are shared on Stanford SCG cluster based on SLURM. You don't have to download any data for testing our pipeline on it.

SSH to SCG's login node.
```
  $ ssh login.scg.stanford.edu
```

Git clone this pipeline and move into it.

  $ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
  $ cd atac-seq-pipeline

Download cromwell.

  $ wget https://github.com/broadinstitute/cromwell/releases/download/34/cromwell-34.jar
  $ chmod +rx cromwell-34.jar

Set your account in workflow_opts/scg.json. Ignore other runtime attributes for singularity.

  {
      "default_runtime_attributes" : {
          "slurm_account" : "YOUR_SLURM_ACCOUNT"
      }
  }

Our pipeline supports both Conda and Singularity.

For Conda users

Install Conda

Install Conda dependencies.

  $ bash conda/uninstall_dependencies.sh  # to remove any existing pipeline env
  $ bash conda/install_dependencies.sh

Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR356KRQ.

  $ source activate encode-atac-seq-pipeline # IMPORTANT!
  $ INPUT=examples/scg/ENCSR356KRQ_subsampled_scg.json
  $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm cromwell-34.jar run atac.wdl -i ${INPUT} -o workflow_opts/scg.json

It will take about an hour. You will be able to find all outputs on cromwell-executions/atac/[RANDOM_HASH_STRING]/. See output directory structure for details.
See full specification for input JSON file.

For singularity users

Pull a singularity container for the pipeline. This will pull pipeline's docker container first and build a singularity one on ~/.singularity.
```
  $ SINGULARITY_PULLFOLDER=~/.singularity singularity pull docker://quay.io/encode-dcc/atac-seq-pipeline:v1.1
```

Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR356KRQ.

  $ INPUT=examples/scg/ENCSR356KRQ_subsampled_scg.json
  $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm_singularity cromwell-34.jar run atac.wdl -i ${INPUT} -o workflow_opts/scg.json

It will take about an hour. You will be able to find all outputs on cromwell-executions/atac/[RANDOM_HASH_STRING]/. See output directory structure for details.
See full specification for input JSON file.
IF YOU WANT TO RUN PIPELINES WITH YOUR OWN INPUT DATA/GENOME DATABASE, PLEASE ADD THEIR DIRECTORIES TO workflow_opts/scg.json. For example, you have input FASTQs on /your/input/fastqs/ and genome database installed on /your/genome/database/ then add /your/ to --bind in singularity_command_options. You can also define multiple directories there. It's comma-separated.
```
  {
      "default_runtime_attributes" : {
          "singularity_container" : "~/.singularity/chip-seq-pipeline-v1.1.simg",
          "singularity_command_options" : "--bind /ifs/scratch,/srv/gsfs0,/your/,YOUR_OWN_DATA_DIR1,YOUR_OWN_DATA_DIR1,..."
      }
  }
```

Running multiple pipelines with cromwell server mode

If you want to run multiple (>10) pipelines, then run a cromwell server on an interactive node. We recommend to use screen or tmux to keep your session alive and note that all running pipelines will be killed after walltime. Run a Cromwell server with the following commands.

  $ srun -n 2 --mem 5G -t 3-0 --qos normal --account [YOUR_SCG_ACCOUNT] --pty /bin/bash -i -l    # 2 CPU, 5 GB RAM and 3 day walltime
  $ hostname -f    # to get [CROMWELL_SVR_IP]

For Conda users,

  $ source activate encode-atac-seq-pipeline 
  $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm cromwell-34.jar server

For singularity users,

  $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm_singularity cromwell-34.jar server

You can modify backend.providers.slurm.concurrent-job-limit or backend.providers.slurm_singularity.concurrent-job-limit in backends/backend.conf to increase maximum concurrent jobs. This limit is NOT PER SAMPLE. It's for all sub-tasks of all submitted samples.

On a login node, submit jobs to the cromwell server. You will get [WORKFLOW_ID] as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later.

  $ INPUT=YOUR_INPUT.json
  $ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \
    -F workflowSource=@atac.wdl \
    -F workflowInputs=@${INPUT} \
    -F workflowOptions=@workflow_opts/scg.json

To monitor pipelines, see cromwell server REST API description for more details. squeue will not give you enough information for monitoring jobs per sample. $ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!