Chipseq updates #111

adthrasher · 2023-09-20T19:27:13Z

Updates to reflect changes to SEAseq and add support for SEAseq's paired-end mode.

a-frantz · 2023-09-25T20:08:58Z

tools/samtools.wdl

-        Boolean paired_end = true
+        Boolean append_read_number = true
        Boolean interleaved = false
        Boolean output_singletons = false
+        Boolean paired_end = true


Just curious, did you reorder these to be alphabetical? The style-guide leaves it up to the author how to order parameters of the same type. My process for ordering things is typically:

follow the order of the underlying tools documentation

what feels most important for a user to see should be towards the top

obviously incredibly subjective here. But with this I'd say paired_end should be first in this block.

I'm fine with this being alphabetical, but I don't think any other inputs are. This will likely be the odd one out if left alphabetical, but we intentionally punted the question of "order once type order is satisfied", so I can't complain about anything you do here.

I did choose alphabetical. That said, in this case, I think it makes sense as paired_end is an input for the task, where the other arguments more closely correspond with inputs to the underlying tool.

Ok. The style-guide leaves it up to individual authors, so if you want to write tasks using alphabetic sort at this level, that's fine with me. But I'm going to continue using my (highly subjective) method of sorting inputs, so there will be some degree of inconsistency. I think that's fine. The fine-grain sorting here is fairly inconsequential.

a-frantz · 2023-09-25T20:10:37Z

tools/samtools.wdl

+
+    parameter_meta {
+        bam: "Input BAM format file to add mate information. Must be name-sorted or name-collated."
+        prefix: "Prefix for the output file. The extension will be added."


Suggested change

prefix: "Prefix for the output file. The extension will be added."

prefix: "Prefix for the output file. The extension `.bam` will be added."

Suggested this before seeing the next line. What I suggested is wrong; we need to a bit more verbose here

a-frantz · 2023-09-25T20:11:49Z

tools/samtools.wdl

+        extension: "File format extension to use for output file. Allowable types: [.sam, .bam, .cram]"
+        add_cigar: "Add template cigar ct tag"
+        add_mate_score: "Add mate score tags. These are used by markdup to select the best reads to keep."
+        disable_proper_pair_check: "Disable FR proper pair check"


IDK what this means. Can we elaborate or add an external_help link?

I'll have to look it up. I also am not sure what it means. I assume that it requires there to be one forward mapped read and one reverse mapped read to make a proper pair, but that isn't clear from the help text from samtools.

a-frantz · 2023-09-25T20:16:41Z

tools/samtools.wdl

+    parameter_meta {
+        bam: "Input BAM format file to add mate information. Must be name-sorted or name-collated."
+        prefix: "Prefix for the output file. The extension will be added."
+        extension: "File format extension to use for output file. Allowable types: [.sam, .bam, .cram]"


IDK how I feel about allowing SAM. IMO we should only be dealing with compressed forms of this data type. SAM just tends to be absurdly large and ultimately impractical.

And by eliminating .sam, this can be changed to a Boolean. Makes life a little easier.

If you disagree and want to support .sam, this line should be updated to use choices: {} syntax.

a-frantz · 2023-09-25T20:21:59Z

workflows/chipseq/chipseq-standard.wdl

@@ -105,15 +107,17 @@ workflow chipseq_standard {
        max_retries=max_retries
    }

-    scatter (pair in zip(bam_to_fastqs.read1s, read_groups)){
+    scatter (tuple in zip(zip(bam_to_fastqs.read1s, bam_to_fastqs.read2s), read_groups)){


Are you positive this works for all combinations of inputs? I thought WDL enforced "every array in a zip must be of equal length". I think this line would break under certain circumstances. Or maybe my understanding of zip is wrong.

I need to run some additional test data through, but for both single-end and paired-end data so far, that argument works.

Can you point me to an input JSON you're using for both PE and SE? I want to walk myself through some concrete examples, because I'm confused how this zip here is working. I'm sure once I can see the "filled in" logic it'll be clear.

Sure. I just moved these from my home directory. I did a sed to update the paths. Let me know if I missed one.

/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_input.json /research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_paired_input.json

a-frantz · 2023-09-25T20:23:55Z

workflows/chipseq/chipseq-standard.wdl

+                    "INVALID_MAPPING_QUALITY",
+                    "MATES_ARE_SAME_END",
+                    "MISMATCH_FLAG_MATE_NEG_STRAND",
+                    "MISMATCH_MATE_ALIGNMENT_START"],


These highlighted 4 feel like things that shouldn't be ignored. Are BAMs really coming out of SEAseq with all these problems? Should that be addressed instead of ignored here?

It's an issue with bowtie. It looks like bowtie is emitting multiple alignments, that have equal alignment scores. Instead of picking one to be primary, it just outputs all of them, without marking any as secondary. Then picard fails to match the read records properly. It should be able to use the mate start position information to find the proper mate, but it doesn't. So you get these errors.

Interesting. Is this a known and logged bug in bowtie? Is the SEAseq team aware this is happening? And is this why you started implementing samtools fixmate? Would that fix these errors? (sorry for the bombardment of questions)

I'd much rather fix these errors over simply ignoring them. If I'm understanding correctly, this update (and possible the current version) are outputting BAMs which don't follow the SPEC. I'm not a fan of passing the buck off to bowtie and saying "not a problem with our code". This BAMs will likely fail downstream. I don't think I'm comfortable just glossing over these errors.

My two cents: we either need an update/rollback on bowtie so this isn't a problem in the first place, or we need a tool like samtools fixmate to address the bowtie problems before we can call these BAMs "harmonized".

I believe the solution is to use bowtie2. I don't think there is any support or development of bowtie at this point. I started implementing samtools fixmate for a different issue, so I didn't try it here.

a-frantz · 2023-09-25T20:27:33Z

workflows/general/bam-to-fastqs.wdl

+    Array[File?] read2s_ = select_first([bam_to_fastq.read_two_fastq_gz, bam_to_fastq_se.read_two_fastq_gz])
+
+     scatter (reads in 
+        zip(read1s_, read2s_)


Same question about zip
https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#arraypairxy-ziparrayx-arrayy

Seems to me this would break in SE mode

It was already using a zip. I just changed the parameters. I think the underlying bam-to-fastq logic emits an Array[File?]. The array would have the correct length, it would just be full of empty elements.

a-frantz · 2023-11-06T22:15:21Z

@adthrasher can you split the samtools.wdl changes off into a stand-alone PR? I think those changes are good and should be merged. They don't have to wait on the ChIP-Seq questions to be resolved.

adthrasher · 2023-11-07T14:50:11Z

@adthrasher can you split the samtools.wdl changes off into a stand-alone PR? I think those changes are good and should be merged. They don't have to wait on the ChIP-Seq questions to be resolved.

Moved to #119

* chore: move from pr #11 * apply review suggestions

* main: feat: Style updates (#122) fix: calc_gene_length algorithm. And better docs for gene length and TPM (#120) Iterative samtools merge (#118) chore: move from pr #111 (#119) feat: experimental dnaseq-standard WF (#116) docs: new stuff for style-guide, best-practices, and template (#117) docs: More docs (#114) fix(qc): finish deleting gc_bias_metrics feat(qc): drop requirement of reference_fasta feat: put strandedness value into HTSeq file name (#90) chore: bump all kraken images to latest (#113)

adthrasher · 2024-09-26T19:14:34Z

workflows/chipseq/chipseq-standard.wdl

            max_retries=max_retries
        }
-        String rg_id_field = sub(sub(pair.right, ".*ID:", "ID:"), "\t.*", "") 
+        String rg_id_field = sub(sub(tuple.right, ".*ID:", "ID:"), "\t.*", "") 
        String rg_id = sub(rg_id_field, "ID:", "")
        call samtools.addreplacerg as single_end { input:


Rename single_end to something else

adthrasher added 6 commits September 20, 2023 13:34

fix: Paired-end ChIP-Seq

d3a1280

fix: missing samtools option

a7f0d6b

docs: add clarifying meta and comments

cbb83af

chore: Bump SEAseq to latest release

d98a1a9

feat: Add samtools fixmate

da5b3c1

chore: reorder parameters

a51216f

adthrasher changed the title ~~[WIP] Chipseq updates~~ Chipseq updates Sep 25, 2023

adthrasher requested a review from a-frantz September 25, 2023 19:13

a-frantz requested changes Sep 25, 2023

View reviewed changes

docs: apply suggestions from code review @afrantz

40ef841

adthrasher mentioned this pull request Nov 7, 2023

chore: move from pr #111 #119

Merged

adthrasher added a commit that referenced this pull request Nov 7, 2023

chore: move from pr #111 (#119)

aa6f8a9

* chore: move from pr #11 * apply review suggestions

adthrasher mentioned this pull request Nov 10, 2023

[WIP] Integrate PEAseq paired-end mode for ChIP-Seq #67

Closed

adthrasher commented Sep 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chipseq updates #111

Chipseq updates #111

adthrasher commented Sep 20, 2023

a-frantz Sep 25, 2023

adthrasher Sep 26, 2023

a-frantz Sep 26, 2023

a-frantz Sep 25, 2023

a-frantz Sep 25, 2023

a-frantz Sep 25, 2023

adthrasher Sep 26, 2023

a-frantz Sep 25, 2023

a-frantz Sep 25, 2023

a-frantz Sep 25, 2023

adthrasher Sep 26, 2023

a-frantz Sep 26, 2023

adthrasher Sep 26, 2023

a-frantz Sep 25, 2023

adthrasher Sep 26, 2023

a-frantz Sep 26, 2023

adthrasher Sep 26, 2023

a-frantz Sep 25, 2023

adthrasher Sep 26, 2023

a-frantz commented Nov 6, 2023

adthrasher commented Nov 7, 2023

adthrasher Sep 26, 2024

	prefix: "Prefix for the output file. The extension will be added."
	prefix: "Prefix for the output file. The extension `.bam` will be added."

Chipseq updates #111

Are you sure you want to change the base?

Chipseq updates #111

Conversation

adthrasher commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-frantz commented Nov 6, 2023

adthrasher commented Nov 7, 2023

Choose a reason for hiding this comment