Skip to content

Latest commit

 

History

History
249 lines (195 loc) · 10.6 KB

input.md

File metadata and controls

249 lines (195 loc) · 10.6 KB

Input JSON

An input JSON file includes all input parameters and metadata for running pipelines. Items 1) and 2) are mandatory. Items 3) and 4) are optional so that our pipeline will use default values if they are not defined. However,

  • Mandatory
  1. Reference genome.
  2. Input data file paths/URIs.
  • Optional
  1. Pipeline parameters.
  2. Resource settings for jobs.

Templates

We provide two template JSON files for both single ended and paired-end samples. We recommend to use one of these input JSON files instead of that used in the tutorial section. These template JSON files include all parameters of the pipeline with default values defined.

Let us take a close look at the following template JSON. Comments are not allowed in a JSON file but we added some comments to help you understand each parameter.

{
    ////////// 1) Reference genome //////////
    // Stanford servers: [GENOME]=hg38,hg19,mm10,mm9
    //   Sherlock: /home/groups/cherry/encode/pipeline_genome_data/[GENOME]_sherlock.tsv
    //   SCG4: /reference/ENCODE/pipeline_genome_data/[GENOME]_scg.tsv

    // Cloud platforms (Google Cloud, DNAnexus): [GENOME]=hg38,hg19,mm10,mm9
    //   Google Cloud: gs://encode-pipeline-genome-data/[GENOME]_google.tsv
    //   DNAnexus: dx://project-BKpvFg00VBPV975PgJ6Q03v6:pipeline-genome-data/[GENOME]_dx.tsv
    //   DNAnexus(Azure): dx://project-F6K911Q9xyfgJ36JFzv03Z5J:pipeline-genome-data/[GENOME]_dx_azure.tsv

    // On other computers download or build reference genome database and pick a TSV from [DEST_DIR].
    //   Downloader: ./genome/download_genome_data.sh [GENOME] [DEST_DIR]
    //   Builder (Conda required): ./conda/build_genome_data.sh [GENOME] [DEST_DIR]

    "chip.genome_tsv" : "/path_to_genome_data/hg38/hg38.tsv",

    ////////// 2) Input data files paths/URIs //////////

    // Read endedness
    "chip.paired_end" : true,

    // If you start from FASTQs then define these, otherwise remove from this file.
    // You can define up to 6 replicates.
    // FASTQs in an array will be merged after trimming adapters.
    // For example, 
    // "rep1_R1_L1.fastq.gz", "rep1_R1_L2.fastq.gz" and "rep1_R1_L3.fastq.gz" will be merged together.
    "chip.fastqs_rep1_R1" : [ "rep1_R1_L1.fastq.gz", "rep1_R1_L2.fastq.gz", "rep1_R1_L3.fastq.gz" ],
    "chip.fastqs_rep1_R2" : [ "rep1_R2_L1.fastq.gz", "rep1_R2_L2.fastq.gz", "rep1_R2_L3.fastq.gz" ],
    "chip.fastqs_rep2_R1" : [ "rep2_R1_L1.fastq.gz", "rep2_R1_L2.fastq.gz" ],
    "chip.fastqs_rep2_R2" : [ "rep2_R2_L1.fastq.gz", "rep2_R2_L2.fastq.gz" ],

    // Define if you have control FASTQs otherwise remove from this file.
    "chip.ctl_fastqs_rep1_R1" : [ "ctl1_R1.fastq.gz" ],
    "chip.ctl_fastqs_rep1_R2" : [ "ctl1_R2.fastq.gz" ],
    "chip.ctl_fastqs_rep2_R1" : [ "ctl2_R1.fastq.gz" ],
    "chip.ctl_fastqs_rep2_R2" : [ "ctl2_R2.fastq.gz" ],

    // If you start from BAMs then define these, otherwise remove from this file.
    // You can define up to 6 replicates. The following example array has two replicates.
    "chip.bams" : [
        "raw_rep1.bam",
        "raw_rep2.bam"
    ],
    // Define if you have control BAMs otherwise remove from this file.
    "chip.ctl_bams" : [
        "raw_ctl1.bam",
        "raw_ctl2.bam"
    ],

    // If you start from filtered/deduped BAMs then define these, otherwise remove from this file.
    // You can define up to 6 replicates. The following example array has two replicates.
    "chip.nodup_bams" : [
        "nodup_rep1.bam",
        "nodup_rep2.bam"
    ],
    // Define if you have control filtered/deduped BAMs otherwise remove from this file.
    "chip.ctl_nodup_bams" : [
        "nodup_ctl1.bam",
        "nodup_ctl2.bam"
    ],

    // If you start from TAG-ALIGNs then define these, otherwise remove from this file.
    // You can define up to 6 replicates. The following example array has two replicates.
    "chip.tas" : [
        "rep1.tagAlign.gz",
        "rep2.tagAlign.gz"
    ],
    // Define if you have control TAG-ALIGNs otherwise remove from this file.
    "chip.ctl_tas" : [
        "ctl1.tagAlign.gz",
        "ctl2.tagAlign.gz"
    ],

    ////////// 3) Pipeline parameters //////////

    // Pipeline title and description
    "chip.title" : "Example (single-ended)",
    "chip.description" : "This is an template input JSON for single-ended sample.",

    // Pipeline type (tf or histone).
    // default peak_caller: spp for tf, macs2 for histone
    "chip.pipeline_type" : "tf",
    // You can also manually specify a peak_caller
    "chip.peak_caller" : "spp",

    // Pipeline will not proceed to post alignment steps (peak-calling, ...).
    // You will get QC report for alignment only.
    "chip.align_only" : false,
    "chip.true_rep_only" : false,

    // Disable deeptools fingerprint (JS distance)
    "chip.disable_fingerprint" : false,

    // Enable count signal track generation
    "chip.enable_count_signal_track" : false,

    // Trim R1 of paired ended fastqs for cross-correlation analysis only
    // Trimmed fastqs will not be used for any other analyses
    "chip.xcor_pe_trim_bp" : 50,

    // Use filtered PE BAM/TAG-ALIGN for cross-correlation analysis ignoring the above trimmed R1 fastq
    "chip.use_filt_pe_ta_for_xcor" : false,

    // Choose a dup marker between picard and sambamba
    // picard is recommended, use sambamba only when picard fails.
    "chip.dup_marker" : "picard",

    // Threshold for mapped reads quality (samtools view -q)
    "chip.mapq_thresh" : 30,

    // Skip dup removal in a BAM filtering stage.
    "chip.no_dup_removal" : false,

    // Name of mito chromosome. THIS IS NOT A REG-EX! you can define only one chromosome name for mito.
    "chip.mito_chr_name" : "chrM",

    // Regular expression to filter out reads with given chromosome name (1st column of BED/TAG-ALIGN)
    // Any read with chr name that matches with this reg-ex pattern will be removed from outputs
    // If your have changed the above parameter "chip.mito_chr_name" and still want to filter out mito reads,
    // then make sure that "chip.mito_chr_name" and "chip.regex_filter_reads" are the same.
    "chip.regex_filter_reads" : "chrM",

    // Subsample reads (0: no subsampling)
    // Subsampled reads will be used for all downsteam analyses including peak-calling
    "chip.subsample_reads" : 0,
    "chip.ctl_subsample_reads" : 0,

    // Cross-correlation analysis
    // Subsample reads for cross-corr. analysis only (0: no subsampling)
    // Subsampled reads will be used for cross-corr. analysis only
    "chip.xcor_subsample_reads" : 15000000,

    // Keep irregular chromosome names
    // Use this for custom genomes without canonical chromosome names (chr1, chrX, ...)
    "chip.keep_irregular_chr_in_bfilt_peak" : false,

    // Choosing an appropriate control for each replicate
    // Always use a pooled control to compare with each replicate.
    // If a single control is given then use it.
    "chip.always_use_pooled_ctl" : false,
    // If ratio of depth between controls is higher than this
    // then always use a pooled control for all replicates.
    "chip.ctl_depth_ratio" : 1.2,

    // Cap number of peaks called from a peak-caller (MACS2)
    "chip.macs2_cap_num_peak" : 500000,
    // P-value threshold for MACS2 (macs2 callpeak -p)
    "chip.pval_thresh" : 0.01,

    // IDR (irreproducible discovery rate)
    // Threshold for IDR
    "chip.idr_thresh" : 0.05,

    // Cap number of peaks called from a peak-caller (SPP)
    "chip.spp_cap_num_peak" : 300000,

    ////////// 5) Resource settings //////////

    // Resources defined here are PER REPLICATE.
    // Therefore, total number of cores will be MAX(["chip.bwa_cpu"] x [NUMBER_OF_REPLICATES], "chip.spp_cpu" x 2 x [NUMBER_OF_REPLICATES])
    // because bwa and spp are bottlenecking tasks of the pipeline.
    // Use this total number of cores if you manually qsub or sbatch your job (using local mode of our pipeline).
    // "disks" is used for Google Cloud and DNAnexus only.

    "chip.bwa_cpu" : 4,
    "chip.bwa_mem_mb" : 20000,
    "chip.bwa_time_hr" : 48,
    "chip.bwa_disks" : "local-disk 100 HDD",

    "chip.filter_cpu" : 2,
    "chip.filter_mem_mb" : 20000,
    "chip.filter_time_hr" : 24,
    "chip.filter_disks" : "local-disk 100 HDD",

    "chip.bam2ta_cpu" : 2,
    "chip.bam2ta_mem_mb" : 10000,
    "chip.bam2ta_time_hr" : 6,
    "chip.bam2ta_disks" : "local-disk 100 HDD",

    "chip.spr_mem_mb" : 16000,

    "chip.fingerprint_cpu" : 2,
    "chip.fingerprint_mem_mb" : 12000,
    "chip.fingerprint_time_hr" : 6,
    "chip.fingerprint_disks" : "local-disk 100 HDD",

    "chip.xcor_cpu" : 2,
    "chip.xcor_mem_mb" : 16000,
    "chip.xcor_time_hr" : 24,
    "chip.xcor_disks" : "local-disk 100 HDD",

    "chip.macs2_mem_mb" : 16000,
    "chip.macs2_time_hr" : 24,
    "chip.macs2_disks" : "local-disk 100 HDD",

    "chip.spp_cpu" : 2,
    "chip.spp_mem_mb" : 16000,
    "chip.spp_time_hr" : 72,
    "chip.spp_disks" : "local-disk 100 HDD",
}

Reference genome

We currently support 4 genomes. You can also build a genome database for your own genome.

genome source built from
hg38 ENCODE GRCh38_no_alt_analysis_set_GCA_000001405
mm10 ENCODE mm10_no_alt_analysis_set_ENCODE
hg19 UCSC GRCh37/hg19
mm9 UCSC mm9, NCBI Build 37

Choose one TSV file for "chip.genome_tsv" in your input JSON. [GENOME] should be hg38, mm10, hg19 or mm9.

platform path/URI
Google Cloud Platform gs://encode-pipeline-genome-data/[GENOME]_google.tsv
DNAnexus (CLI) dx://project-BKpvFg00VBPV975PgJ6Q03v6:pipeline-genome-data/[GENOME]_dx.tsv
DNAnexus (CLI, Azure) dx://project-F6K911Q9xyfgJ36JFzv03Z5J:pipeline-genome-data/[GENOME]_dx_azure.tsv
DNAnexus (Web) Choose [GENOME]_dx.tsv from here
DNAnexus (Web, Azure) Choose [GENOME]_dx.tsv from here
Stanford Sherlock /home/groups/cherry/encode/pipeline_genome_data/[GENOME]_sherlock.tsv
Stanford SCG /reference/ENCODE/pipeline_genome_data/[GENOME]_scg.tsv
Local/SLURM/SGE You need to build a genome database.