-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chipseq updates #111
base: main
Are you sure you want to change the base?
Chipseq updates #111
Conversation
Boolean paired_end = true | ||
Boolean append_read_number = true | ||
Boolean interleaved = false | ||
Boolean output_singletons = false | ||
Boolean paired_end = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, did you reorder these to be alphabetical? The style-guide leaves it up to the author how to order parameters of the same type. My process for ordering things is typically:
- follow the order of the underlying tools documentation
- what feels most important for a user to see should be towards the top
- obviously incredibly subjective here. But with this I'd say
paired_end
should be first in this block.
- obviously incredibly subjective here. But with this I'd say
I'm fine with this being alphabetical, but I don't think any other inputs are. This will likely be the odd one out if left alphabetical, but we intentionally punted the question of "order once type order is satisfied", so I can't complain about anything you do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did choose alphabetical. That said, in this case, I think it makes sense as paired_end
is an input for the task, where the other arguments more closely correspond with inputs to the underlying tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. The style-guide leaves it up to individual authors, so if you want to write tasks using alphabetic sort at this level, that's fine with me. But I'm going to continue using my (highly subjective) method of sorting inputs, so there will be some degree of inconsistency. I think that's fine. The fine-grain sorting here is fairly inconsequential.
tools/samtools.wdl
Outdated
|
||
parameter_meta { | ||
bam: "Input BAM format file to add mate information. Must be name-sorted or name-collated." | ||
prefix: "Prefix for the output file. The extension will be added." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix: "Prefix for the output file. The extension will be added." | |
prefix: "Prefix for the output file. The extension `.bam` will be added." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested this before seeing the next line. What I suggested is wrong; we need to a bit more verbose here
tools/samtools.wdl
Outdated
extension: "File format extension to use for output file. Allowable types: [.sam, .bam, .cram]" | ||
add_cigar: "Add template cigar ct tag" | ||
add_mate_score: "Add mate score tags. These are used by markdup to select the best reads to keep." | ||
disable_proper_pair_check: "Disable FR proper pair check" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK what this means. Can we elaborate or add an external_help
link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to look it up. I also am not sure what it means. I assume that it requires there to be one forward mapped read and one reverse mapped read to make a proper pair, but that isn't clear from the help text from samtools
.
tools/samtools.wdl
Outdated
parameter_meta { | ||
bam: "Input BAM format file to add mate information. Must be name-sorted or name-collated." | ||
prefix: "Prefix for the output file. The extension will be added." | ||
extension: "File format extension to use for output file. Allowable types: [.sam, .bam, .cram]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK how I feel about allowing SAM. IMO we should only be dealing with compressed forms of this data type. SAM just tends to be absurdly large and ultimately impractical.
And by eliminating .sam
, this can be changed to a Boolean. Makes life a little easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you disagree and want to support .sam
, this line should be updated to use choices: {}
syntax.
@@ -105,15 +107,17 @@ workflow chipseq_standard { | |||
max_retries=max_retries | |||
} | |||
|
|||
scatter (pair in zip(bam_to_fastqs.read1s, read_groups)){ | |||
scatter (tuple in zip(zip(bam_to_fastqs.read1s, bam_to_fastqs.read2s), read_groups)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you positive this works for all combinations of inputs? I thought WDL enforced "every array in a zip must be of equal length". I think this line would break under certain circumstances. Or maybe my understanding of zip
is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to run some additional test data through, but for both single-end and paired-end data so far, that argument works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to an input JSON you're using for both PE and SE? I want to walk myself through some concrete examples, because I'm confused how this zip
here is working. I'm sure once I can see the "filled in" logic it'll be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I just moved these from my home directory. I did a sed
to update the paths. Let me know if I missed one.
/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_input.json
/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_paired_input.json
"INVALID_MAPPING_QUALITY", | ||
"MATES_ARE_SAME_END", | ||
"MISMATCH_FLAG_MATE_NEG_STRAND", | ||
"MISMATCH_MATE_ALIGNMENT_START"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These highlighted 4 feel like things that shouldn't be ignored. Are BAMs really coming out of SEAseq with all these problems? Should that be addressed instead of ignored here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an issue with bowtie
. It looks like bowtie
is emitting multiple alignments, that have equal alignment scores. Instead of picking one to be primary, it just outputs all of them, without marking any as secondary. Then picard
fails to match the read records properly. It should be able to use the mate start position information to find the proper mate, but it doesn't. So you get these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Is this a known and logged bug in bowtie
? Is the SEAseq team aware this is happening? And is this why you started implementing samtools fixmate
? Would that fix these errors? (sorry for the bombardment of questions)
I'd much rather fix these errors over simply ignoring them. If I'm understanding correctly, this update (and possible the current version) are outputting BAMs which don't follow the SPEC. I'm not a fan of passing the buck off to bowtie
and saying "not a problem with our code". This BAMs will likely fail downstream. I don't think I'm comfortable just glossing over these errors.
My two cents: we either need an update/rollback on bowtie
so this isn't a problem in the first place, or we need a tool like samtools fixmate
to address the bowtie problems before we can call these BAMs "harmonized".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the solution is to use bowtie2
. I don't think there is any support or development of bowtie
at this point. I started implementing samtools fixmate
for a different issue, so I didn't try it here.
Array[File?] read2s_ = select_first([bam_to_fastq.read_two_fastq_gz, bam_to_fastq_se.read_two_fastq_gz]) | ||
|
||
scatter (reads in | ||
zip(read1s_, read2s_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about zip
https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#arraypairxy-ziparrayx-arrayy
Seems to me this would break in SE mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was already using a zip
. I just changed the parameters. I think the underlying bam-to-fastq logic emits an Array[File?]
. The array would have the correct length, it would just be full of empty elements.
@adthrasher can you split the |
Moved to #119 |
* main: feat: Style updates (#122) fix: calc_gene_length algorithm. And better docs for gene length and TPM (#120) Iterative samtools merge (#118) chore: move from pr #111 (#119) feat: experimental dnaseq-standard WF (#116) docs: new stuff for style-guide, best-practices, and template (#117) docs: More docs (#114) fix(qc): finish deleting gc_bias_metrics feat(qc): drop requirement of reference_fasta feat: put strandedness value into HTSeq file name (#90) chore: bump all kraken images to latest (#113)
max_retries=max_retries | ||
} | ||
String rg_id_field = sub(sub(pair.right, ".*ID:", "ID:"), "\t.*", "") | ||
String rg_id_field = sub(sub(tuple.right, ".*ID:", "ID:"), "\t.*", "") | ||
String rg_id = sub(rg_id_field, "ID:", "") | ||
call samtools.addreplacerg as single_end { input: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename single_end
to something else
Updates to reflect changes to SEAseq and add support for SEAseq's paired-end mode.