Skip to content

Latest commit

 

History

History
118 lines (90 loc) · 5.48 KB

tutorial_scg.md

File metadata and controls

118 lines (90 loc) · 5.48 KB

Tutorial for Stanford SCG cluster

All test samples and genome data are shared on Stanford SCG cluster based on SLURM. You don't have to download any data for testing our pipeline on it.

  1. SSH to SCG's login node.

      $ ssh login.scg.stanford.edu
    
  2. Git clone this pipeline and move into it.

      $ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
      $ cd atac-seq-pipeline
    
  3. Download cromwell.

      $ wget https://github.com/broadinstitute/cromwell/releases/download/34/cromwell-34.jar
      $ chmod +rx cromwell-34.jar
    
  4. Set your account in workflow_opts/scg.json. Ignore other runtime attributes for singularity.

      {
          "default_runtime_attributes" : {
              "slurm_account" : "YOUR_SLURM_ACCOUNT"
          }
      }
    

Our pipeline supports both Conda and Singularity.

For Conda users

  1. Install Conda

  2. Install Conda dependencies.

      $ bash conda/uninstall_dependencies.sh  # to remove any existing pipeline env
      $ bash conda/install_dependencies.sh
    
  3. Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR356KRQ.

      $ source activate encode-atac-seq-pipeline # IMPORTANT!
      $ INPUT=examples/scg/ENCSR356KRQ_subsampled_scg.json
      $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm cromwell-34.jar run atac.wdl -i ${INPUT} -o workflow_opts/scg.json
    
  4. It will take about an hour. You will be able to find all outputs on cromwell-executions/atac/[RANDOM_HASH_STRING]/. See output directory structure for details.

  5. See full specification for input JSON file.

For singularity users

  1. Pull a singularity container for the pipeline. This will pull pipeline's docker container first and build a singularity one on ~/.singularity.

      $ SINGULARITY_PULLFOLDER=~/.singularity singularity pull docker://quay.io/encode-dcc/atac-seq-pipeline:v1.1
    
  2. Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR356KRQ.

      $ INPUT=examples/scg/ENCSR356KRQ_subsampled_scg.json
      $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm_singularity cromwell-34.jar run atac.wdl -i ${INPUT} -o workflow_opts/scg.json
    
  3. It will take about an hour. You will be able to find all outputs on cromwell-executions/atac/[RANDOM_HASH_STRING]/. See output directory structure for details.

  4. See full specification for input JSON file.

  5. IF YOU WANT TO RUN PIPELINES WITH YOUR OWN INPUT DATA/GENOME DATABASE, PLEASE ADD THEIR DIRECTORIES TO workflow_opts/scg.json. For example, you have input FASTQs on /your/input/fastqs/ and genome database installed on /your/genome/database/ then add /your/ to --bind in singularity_command_options. You can also define multiple directories there. It's comma-separated.

      {
          "default_runtime_attributes" : {
              "singularity_container" : "~/.singularity/chip-seq-pipeline-v1.1.simg",
              "singularity_command_options" : "--bind /ifs/scratch,/srv/gsfs0,/your/,YOUR_OWN_DATA_DIR1,YOUR_OWN_DATA_DIR1,..."
          }
      }
    

Running multiple pipelines with cromwell server mode

  1. If you want to run multiple (>10) pipelines, then run a cromwell server on an interactive node. We recommend to use screen or tmux to keep your session alive and note that all running pipelines will be killed after walltime. Run a Cromwell server with the following commands.

      $ srun -n 2 --mem 5G -t 3-0 --qos normal --account [YOUR_SCG_ACCOUNT] --pty /bin/bash -i -l    # 2 CPU, 5 GB RAM and 3 day walltime
      $ hostname -f    # to get [CROMWELL_SVR_IP]
    

    For Conda users,

      $ source activate encode-atac-seq-pipeline 
      $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm cromwell-34.jar server
    

    For singularity users,

      $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm_singularity cromwell-34.jar server
    
  2. You can modify backend.providers.slurm.concurrent-job-limit or backend.providers.slurm_singularity.concurrent-job-limit in backends/backend.conf to increase maximum concurrent jobs. This limit is NOT PER SAMPLE. It's for all sub-tasks of all submitted samples.

  3. On a login node, submit jobs to the cromwell server. You will get [WORKFLOW_ID] as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later.

      $ INPUT=YOUR_INPUT.json
      $ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \
        -F [email protected] \
        -F workflowInputs=@${INPUT} \
        -F workflowOptions=@workflow_opts/scg.json
    

To monitor pipelines, see cromwell server REST API description for more details. squeue will not give you enough information for monitoring jobs per sample. $ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"