Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chipseq updates #111

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 92 additions & 12 deletions tools/samtools.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -431,7 +431,7 @@ task collate {

task bam_to_fastq {
meta {
description: "This WDL task runs `samtools fastq` on the input BAM file. Splits the BAM into FASTQ files. Assumes either a name sorted or collated BAM. For splitting a position sorted BAM see `collate_to_fastq`."
description: "This WDL task runs `samtools fastq` on the input BAM file. Converts the BAM into FASTQ files. If paired-end = false, then all reads in the BAM will be output to a single FASTQ file. Use filtering arguments to remove any unwanted reads. Assumes either a name sorted or collated BAM. For splitting a position sorted BAM see `collate_to_fastq`."
outputs: {
read_one_fastq_gz: "Gzipped FASTQ file with 1st reads in pair"
read_two_fastq_gz: "Gzipped FASTQ file with 2nd reads in pair"
Expand All @@ -446,9 +446,10 @@ task bam_to_fastq {
f: "Only output alignments with all bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x` (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0` (i.e. /^0[0-7]+/)."
F: "Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x` (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0` (i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and supplementary alignments."
G: "Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x` (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0` (i.e. /^0[0-7]+/)."
paired_end: "Is the data paired-end?"
append_read_number: "Append /1 and /2 suffixes to read names"
interleaved: "Create an interleaved FASTQ file from paired-end data?"
output_singletons: "Output singleton reads as their own FASTQ?"
paired_end: "Is the data paired-end?"
ncpu: "Number of cores to allocate for task"
memory_gb: "RAM to allocate for task, specified in GB"
modify_disk_size_gb: "Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB."
Expand All @@ -462,9 +463,10 @@ task bam_to_fastq {
String f = "0"
String F = "0x900"
String G = "0"
Boolean paired_end = true
Boolean append_read_number = true
Boolean interleaved = false
Boolean output_singletons = false
Boolean paired_end = true
Comment on lines -465 to +469
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, did you reorder these to be alphabetical? The style-guide leaves it up to the author how to order parameters of the same type. My process for ordering things is typically:

  1. follow the order of the underlying tools documentation
  2. what feels most important for a user to see should be towards the top
    • obviously incredibly subjective here. But with this I'd say paired_end should be first in this block.

I'm fine with this being alphabetical, but I don't think any other inputs are. This will likely be the odd one out if left alphabetical, but we intentionally punted the question of "order once type order is satisfied", so I can't complain about anything you do here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did choose alphabetical. That said, in this case, I think it makes sense as paired_end is an input for the task, where the other arguments more closely correspond with inputs to the underlying tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The style-guide leaves it up to individual authors, so if you want to write tasks using alphabetic sort at this level, that's fine with me. But I'm going to continue using my (highly subjective) method of sorting inputs, so there will be some degree of inconsistency. I think that's fine. The fine-grain sorting here is fairly inconsequential.

Boolean use_all_cores = false
Int ncpu = 1
Int memory_gb = 4
Expand All @@ -488,22 +490,35 @@ task bam_to_fastq {
-f ~{f} \
-F ~{F} \
-G ~{G} \
-1 ~{if interleaved
then prefix + ".fastq.gz"
else prefix + "_R1.fastq.gz"
~{if append_read_number
then "-N"
else "-n"
} \
-1 ~{
if paired_end then (
if interleaved then prefix + ".fastq.gz" else prefix + "_R1.fastq.gz"
)
else prefix + ".fastq.gz"
} \
-2 ~{
if paired_end then (
if interleaved then prefix + ".fastq.gz" else prefix + "_R2.fastq.gz"
)
else "/dev/null"
else prefix + ".fastq.gz"
} \
~{
if paired_end then (
if output_singletons
then "-s " + prefix+".singleton.fastq.gz"
else "-s junk.singleton.fastq.gz"
)
else ""
} \
-s ~{
if output_singletons
then prefix+".singleton.fastq.gz"
else "/dev/null"
-0 ~{
if paired_end
then "junk.unknown_bit_setting.fastq.gz"
else prefix + ".fastq.gz"
} \
-0 /dev/null \
~{bam}
>>>

Expand Down Expand Up @@ -637,3 +652,68 @@ task collate_to_fastq {
maxRetries: max_retries
}
}

task fixmate {
meta {
description: "This WDL task runs Samtools fixmate on the input BAM file. This fills in mate coordinates and insert size fields."
}

parameter_meta {
bam: "Input BAM format file to add mate information. Must be name-sorted or name-collated."
prefix: "Prefix for the output file. The extension specified with the `extension` parameter will be added."
extension: {
description: "File format extension to use for output file.",
choices: [
".sam",
".bam",
".cram"
]
}
add_cigar: "Add template cigar ct tag"
add_mate_score: "Add mate score tags. These are used by markdup to select the best reads to keep."
disable_proper_pair_check: "Disable proper pair check [ensure one forward and one reverse read in each pair]"
remove_unaligned_and_secondary: "Remove unmapped and secondary reads"
ncpu: "Number of cores to allocate for task"
memory_gb: "RAM to allocate for task, specified in GB"
modify_disk_size_gb: "Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB."
max_retries: "Number of times to retry in case of failure"
}

input {
File bam
String prefix = basename(bam, ".bam") + ".fixmate"
String extension = ".bam"
Boolean add_cigar = true
Boolean add_mate_score = true
Boolean disable_proper_pair_check = false
Boolean remove_unaligned_and_secondary = false
Int ncpu = 1
Int memory_gb = 4
Int modify_disk_size_gb = 0
Int max_retries = 1
}

Float bam_size = size(bam, "GiB")
Int disk_size_gb = ceil(bam_size) + 10 + modify_disk_size_gb

command <<<
samtools fixmate \
~{if remove_unaligned_and_secondary then "-r" else ""} \
~{if disable_proper_pair_check then "-p" else ""} \
~{if add_cigar then "-c" else ""} \
~{if add_mate_score then "-m" else ""} \
~{bam} ~{prefix}~{extension}
>>>

output {
File fixmate_bam = "~{prefix}~{extension}"
}

runtime {
cpu: ncpu
memory: "~{memory_gb} GB"
disk: "~{disk_size_gb} GB"
docker: 'quay.io/biocontainers/samtools:1.16.1--h6899075_1'
maxRetries: max_retries
}
}
5 changes: 4 additions & 1 deletion tools/util.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,10 @@ task add_to_bam_header {
String outfile_name = prefix + ".bam"

command <<<
samtools view -H ~{bam} > header.sam
set -euo pipefail
# Remove trailing tab characters that some tools (bowtie) add
# to header records. Also skip adding a @PG record for this operation.
samtools view --no-PG -H ~{bam} | sed 's/\t$//' > header.sam
echo "~{additional_header}" >> header.sam
samtools reheader -P header.sam ~{bam} > ~{outfile_name}
>>>
Expand Down
43 changes: 28 additions & 15 deletions workflows/chipseq/chipseq-standard.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,15 @@ import "../../tools/picard.wdl"
import "../../tools/samtools.wdl"
import "../../tools/util.wdl"
import "../general/bam-to-fastqs.wdl" as b2fq
import "https://raw.githubusercontent.com/stjude/seaseq/2.3/workflows/workflows/mapping.wdl" as seaseq_map
import "https://raw.githubusercontent.com/stjude/seaseq/3.0/workflows/tasks/samtools.wdl" as seaseq_samtools
import "https://raw.githubusercontent.com/stjude/seaseq/3.0/workflows/tasks/seaseq_util.wdl" as seaseq_util
import "https://raw.githubusercontent.com/stjude/seaseq/3.1/workflows/workflows/mapping.wdl" as seaseq_map
import "https://raw.githubusercontent.com/stjude/seaseq/3.1/workflows/tasks/samtools.wdl" as seaseq_samtools
import "https://raw.githubusercontent.com/stjude/seaseq/3.1/workflows/tasks/seaseq_util.wdl" as seaseq_util

workflow chipseq_standard {
parameter_meta {
bam: "Input BAM format file to realign with bowtie"
bowtie_indexes: "Database of v1 reference files for the bowtie aligner. Can be generated with https://github.com/stjude/seaseq/blob/master/workflows/tasks/bowtie.wdl. [*.ebwt]"
paired_end: "Is the data paired-end (true) or single-end (false)?"
excludelist: "Optional list of regions that will be excluded after reference alignment"
prefix: "Prefix for output files"
validate_input: "Run Picard ValidateSamFile on the input BAM"
Expand All @@ -55,6 +56,7 @@ workflow chipseq_standard {
input {
File bam
Array[File] bowtie_indexes
Boolean paired_end = false
File? excludelist
String prefix = basename(bam, ".bam")
Boolean validate_input = true
Expand Down Expand Up @@ -89,7 +91,7 @@ workflow chipseq_standard {

call b2fq.bam_to_fastqs { input:
bam=selected_bam,
paired_end=false,
paired_end=paired_end,
use_all_cores=use_all_cores,
max_retries=max_retries
}
Expand All @@ -105,29 +107,31 @@ workflow chipseq_standard {
max_retries=max_retries
}

scatter (pair in zip(bam_to_fastqs.read1s, read_groups)){
scatter (tuple in zip(zip(bam_to_fastqs.read1s, bam_to_fastqs.read2s), read_groups)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you positive this works for all combinations of inputs? I thought WDL enforced "every array in a zip must be of equal length". I think this line would break under certain circumstances. Or maybe my understanding of zip is wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to run some additional test data through, but for both single-end and paired-end data so far, that argument works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to an input JSON you're using for both PE and SE? I want to walk myself through some concrete examples, because I'm confused how this zip here is working. I'm sure once I can see the "filled in" logic it'll be clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I just moved these from my home directory. I did a sed to update the paths. Let me know if I missed one.

/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_input.json
/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_paired_input.json

call seaseq_util.basicfastqstats as basic_stats { input:
fastqfile=pair.left
fastqfile=tuple.left.left
}
call seaseq_map.mapping as bowtie_single_end_mapping { input:
fastqfile=pair.left,
call seaseq_map.mapping as bowtie_mapping { input:
fastqfile=tuple.left.left, # the FASTQ pair is the left of the first pair, then it is R1 = left, R2 = right in the nested pair
fastqfile_R2=tuple.left.right,
index_files=bowtie_indexes,
metricsfile=basic_stats.metrics_out,
blacklist=excludelist
blacklist=excludelist,
paired_end=paired_end
}
File chosen_bam = select_first(
[
bowtie_single_end_mapping.bklist_bam,
bowtie_single_end_mapping.mkdup_bam,
bowtie_single_end_mapping.sorted_bam
bowtie_mapping.bklist_bam,
bowtie_mapping.mkdup_bam,
bowtie_mapping.sorted_bam
]
)
call util.add_to_bam_header { input:
bam=chosen_bam,
additional_header=pair.right,
additional_header=tuple.right,
max_retries=max_retries
}
String rg_id_field = sub(sub(pair.right, ".*ID:", "ID:"), "\t.*", "")
String rg_id_field = sub(sub(tuple.right, ".*ID:", "ID:"), "\t.*", "")
String rg_id = sub(rg_id_field, "ID:", "")
call samtools.addreplacerg as single_end { input:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename single_end to something else

bam=add_to_bam_header.reheadered_bam,
Expand Down Expand Up @@ -159,7 +163,16 @@ workflow chipseq_standard {
use_all_cores=use_all_cores,
max_retries=max_retries
}
call picard.validate_bam { input: bam=markdup.mkdupbam, max_retries=max_retries }
call picard.validate_bam { input:
bam=markdup.mkdupbam,
ignore_list=["MISSING_PLATFORM_VALUE",
"INVALID_PLATFORM_VALUE",
"INVALID_MAPPING_QUALITY",
"MATES_ARE_SAME_END",
"MISMATCH_FLAG_MATE_NEG_STRAND",
"MISMATCH_MATE_ALIGNMENT_START"],
Comment on lines +170 to +173
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These highlighted 4 feel like things that shouldn't be ignored. Are BAMs really coming out of SEAseq with all these problems? Should that be addressed instead of ignored here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an issue with bowtie. It looks like bowtie is emitting multiple alignments, that have equal alignment scores. Instead of picking one to be primary, it just outputs all of them, without marking any as secondary. Then picard fails to match the read records properly. It should be able to use the mate start position information to find the proper mate, but it doesn't. So you get these errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Is this a known and logged bug in bowtie? Is the SEAseq team aware this is happening? And is this why you started implementing samtools fixmate? Would that fix these errors? (sorry for the bombardment of questions)

I'd much rather fix these errors over simply ignoring them. If I'm understanding correctly, this update (and possible the current version) are outputting BAMs which don't follow the SPEC. I'm not a fan of passing the buck off to bowtie and saying "not a problem with our code". This BAMs will likely fail downstream. I don't think I'm comfortable just glossing over these errors.

My two cents: we either need an update/rollback on bowtie so this isn't a problem in the first place, or we need a tool like samtools fixmate to address the bowtie problems before we can call these BAMs "harmonized".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the solution is to use bowtie2. I don't think there is any support or development of bowtie at this point. I started implementing samtools fixmate for a different issue, so I didn't try it here.

max_retries=max_retries
}

call md5sum.compute_checksum { input:
file=markdup.mkdupbam,
Expand Down
43 changes: 32 additions & 11 deletions workflows/general/bam-to-fastqs.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -54,18 +54,39 @@ workflow bam_to_fastqs {

call samtools.quickcheck { input: bam=bam, max_retries=max_retries }
call samtools.split { input: bam=bam, use_all_cores=use_all_cores, max_retries=max_retries }
scatter (split_bam in split.split_bams) {
call samtools.collate_to_fastq as bam_to_fastq { input:
bam=split_bam,
paired_end=paired_end,
interleaved=false, # matches default but prevents user from overriding
use_all_cores=use_all_cores,
max_retries=max_retries

if (paired_end){
scatter (split_bam in split.split_bams) {
call samtools.collate_to_fastq as bam_to_fastq { input:
bam=split_bam,
paired_end=paired_end,
interleaved=false, # matches default but prevents user from overriding
use_all_cores=use_all_cores,
max_retries=max_retries
}
}
}

scatter (reads in
zip(bam_to_fastq.read_one_fastq_gz, bam_to_fastq.read_two_fastq_gz)
if (!paired_end){
scatter (split_bam in split.split_bams) {
call samtools.bam_to_fastq as bam_to_fastq_se { input:
bam=split_bam,
paired_end=paired_end,
interleaved=false, # matches default but prevents user from overriding
output_singletons=true,
use_all_cores=use_all_cores,
max_retries=max_retries
}
}
}

Array[File?] r1 = select_first([bam_to_fastq.read_one_fastq_gz, bam_to_fastq_se.interleaved_reads_fastq_gz])
Array[File] read1s_ = select_all(r1)

Array[File?] read2s_ = select_first([bam_to_fastq.read_two_fastq_gz, bam_to_fastq_se.read_two_fastq_gz])

scatter (reads in
zip(read1s_, read2s_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about zip
https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#arraypairxy-ziparrayx-arrayy

Seems to me this would break in SE mode

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already using a zip. I just changed the parameters. I think the underlying bam-to-fastq logic emits an Array[File?]. The array would have the correct length, it would just be full of empty elements.

) {
call fq.fqlint { input:
read_one_fastq=select_first([reads.left, "undefined"]),
Expand All @@ -75,7 +96,7 @@ workflow bam_to_fastqs {
}

output {
Array[File] read1s = select_all(bam_to_fastq.read_one_fastq_gz)
Array[File?] read2s = bam_to_fastq.read_two_fastq_gz
Array[File] read1s = read1s_
Array[File?] read2s = read2s_
}
}