Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chipseq updates #111

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Chipseq updates #111

wants to merge 7 commits into from

Conversation

adthrasher
Copy link
Member

Updates to reflect changes to SEAseq and add support for SEAseq's paired-end mode.

@adthrasher adthrasher changed the title [WIP] Chipseq updates Chipseq updates Sep 25, 2023
@adthrasher adthrasher requested a review from a-frantz September 25, 2023 19:13
Comment on lines -465 to +469
Boolean paired_end = true
Boolean append_read_number = true
Boolean interleaved = false
Boolean output_singletons = false
Boolean paired_end = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, did you reorder these to be alphabetical? The style-guide leaves it up to the author how to order parameters of the same type. My process for ordering things is typically:

  1. follow the order of the underlying tools documentation
  2. what feels most important for a user to see should be towards the top
    • obviously incredibly subjective here. But with this I'd say paired_end should be first in this block.

I'm fine with this being alphabetical, but I don't think any other inputs are. This will likely be the odd one out if left alphabetical, but we intentionally punted the question of "order once type order is satisfied", so I can't complain about anything you do here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did choose alphabetical. That said, in this case, I think it makes sense as paired_end is an input for the task, where the other arguments more closely correspond with inputs to the underlying tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The style-guide leaves it up to individual authors, so if you want to write tasks using alphabetic sort at this level, that's fine with me. But I'm going to continue using my (highly subjective) method of sorting inputs, so there will be some degree of inconsistency. I think that's fine. The fine-grain sorting here is fairly inconsequential.


parameter_meta {
bam: "Input BAM format file to add mate information. Must be name-sorted or name-collated."
prefix: "Prefix for the output file. The extension will be added."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prefix: "Prefix for the output file. The extension will be added."
prefix: "Prefix for the output file. The extension `.bam` will be added."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested this before seeing the next line. What I suggested is wrong; we need to a bit more verbose here

extension: "File format extension to use for output file. Allowable types: [.sam, .bam, .cram]"
add_cigar: "Add template cigar ct tag"
add_mate_score: "Add mate score tags. These are used by markdup to select the best reads to keep."
disable_proper_pair_check: "Disable FR proper pair check"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK what this means. Can we elaborate or add an external_help link?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to look it up. I also am not sure what it means. I assume that it requires there to be one forward mapped read and one reverse mapped read to make a proper pair, but that isn't clear from the help text from samtools.

parameter_meta {
bam: "Input BAM format file to add mate information. Must be name-sorted or name-collated."
prefix: "Prefix for the output file. The extension will be added."
extension: "File format extension to use for output file. Allowable types: [.sam, .bam, .cram]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK how I feel about allowing SAM. IMO we should only be dealing with compressed forms of this data type. SAM just tends to be absurdly large and ultimately impractical.

And by eliminating .sam, this can be changed to a Boolean. Makes life a little easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you disagree and want to support .sam, this line should be updated to use choices: {} syntax.

@@ -105,15 +107,17 @@ workflow chipseq_standard {
max_retries=max_retries
}

scatter (pair in zip(bam_to_fastqs.read1s, read_groups)){
scatter (tuple in zip(zip(bam_to_fastqs.read1s, bam_to_fastqs.read2s), read_groups)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you positive this works for all combinations of inputs? I thought WDL enforced "every array in a zip must be of equal length". I think this line would break under certain circumstances. Or maybe my understanding of zip is wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to run some additional test data through, but for both single-end and paired-end data so far, that argument works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to an input JSON you're using for both PE and SE? I want to walk myself through some concrete examples, because I'm confused how this zip here is working. I'm sure once I can see the "filled in" logic it'll be clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I just moved these from my home directory. I did a sed to update the paths. Let me know if I missed one.

/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_input.json
/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/athrashe/chipseq_workflow/chipseq_paired_input.json

Comment on lines +170 to +173
"INVALID_MAPPING_QUALITY",
"MATES_ARE_SAME_END",
"MISMATCH_FLAG_MATE_NEG_STRAND",
"MISMATCH_MATE_ALIGNMENT_START"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These highlighted 4 feel like things that shouldn't be ignored. Are BAMs really coming out of SEAseq with all these problems? Should that be addressed instead of ignored here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an issue with bowtie. It looks like bowtie is emitting multiple alignments, that have equal alignment scores. Instead of picking one to be primary, it just outputs all of them, without marking any as secondary. Then picard fails to match the read records properly. It should be able to use the mate start position information to find the proper mate, but it doesn't. So you get these errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Is this a known and logged bug in bowtie? Is the SEAseq team aware this is happening? And is this why you started implementing samtools fixmate? Would that fix these errors? (sorry for the bombardment of questions)

I'd much rather fix these errors over simply ignoring them. If I'm understanding correctly, this update (and possible the current version) are outputting BAMs which don't follow the SPEC. I'm not a fan of passing the buck off to bowtie and saying "not a problem with our code". This BAMs will likely fail downstream. I don't think I'm comfortable just glossing over these errors.

My two cents: we either need an update/rollback on bowtie so this isn't a problem in the first place, or we need a tool like samtools fixmate to address the bowtie problems before we can call these BAMs "harmonized".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the solution is to use bowtie2. I don't think there is any support or development of bowtie at this point. I started implementing samtools fixmate for a different issue, so I didn't try it here.

Array[File?] read2s_ = select_first([bam_to_fastq.read_two_fastq_gz, bam_to_fastq_se.read_two_fastq_gz])

scatter (reads in
zip(read1s_, read2s_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about zip
https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#arraypairxy-ziparrayx-arrayy

Seems to me this would break in SE mode

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already using a zip. I just changed the parameters. I think the underlying bam-to-fastq logic emits an Array[File?]. The array would have the correct length, it would just be full of empty elements.

@a-frantz
Copy link
Member

a-frantz commented Nov 6, 2023

@adthrasher can you split the samtools.wdl changes off into a stand-alone PR? I think those changes are good and should be merged. They don't have to wait on the ChIP-Seq questions to be resolved.

@adthrasher
Copy link
Member Author

@adthrasher can you split the samtools.wdl changes off into a stand-alone PR? I think those changes are good and should be merged. They don't have to wait on the ChIP-Seq questions to be resolved.

Moved to #119

adthrasher added a commit that referenced this pull request Nov 7, 2023
* chore: move from pr #11

* apply review suggestions
a-frantz added a commit that referenced this pull request Nov 22, 2023
* main:
  feat: Style updates (#122)
  fix: calc_gene_length algorithm. And better docs for gene length and TPM (#120)
  Iterative samtools merge (#118)
  chore: move from pr #111 (#119)
  feat: experimental dnaseq-standard WF (#116)
  docs: new stuff for style-guide, best-practices, and template (#117)
  docs: More docs (#114)
  fix(qc): finish deleting gc_bias_metrics
  feat(qc): drop requirement of reference_fasta
  feat: put strandedness value into HTSeq file name (#90)
  chore: bump all kraken images to latest (#113)
max_retries=max_retries
}
String rg_id_field = sub(sub(pair.right, ".*ID:", "ID:"), "\t.*", "")
String rg_id_field = sub(sub(tuple.right, ".*ID:", "ID:"), "\t.*", "")
String rg_id = sub(rg_id_field, "ID:", "")
call samtools.addreplacerg as single_end { input:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename single_end to something else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants