-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline for ingesting high-coverage PB/ONT aligned BAMs and format them for re-processing #418
base: main
Are you sure you want to change the base?
Conversation
docker/lr-custom-gatk/Dockerfile
Outdated
/usr/share/man/??_* && \ | ||
samtools --version | ||
|
||
# echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: take this away in final push
docker/lr-gcloud-samtools/Makefile
Outdated
@@ -0,0 +1,12 @@ | |||
VERSION = 0.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down the road, the intention is to replace lr-basic
with this image.
You'll see here that the only meaningful difference with lr-basic
is it installs google-cloud-cli
not google-cloud-sdk
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we eliminate lr-basic now?
def61fc
to
185abdb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of suggestions, but they're mostly pretty trivial. Adopt the ones that seem in good taste to you.
@@ -1,17 +1,11 @@ | |||
version: 1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at this in detail. I'm going to assume it's OK.
@@ -0,0 +1,73 @@ | |||
FROM python:3.9.16-slim-bullseye |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dockerfiles are so long, and have so much repeated content.
Wouldn't it be possible to start with your lr-gcloud-samtools image as a first stage?
libcurl4-openssl-dev \ | ||
liblzma-dev \ | ||
libncurses5-dev \ | ||
autoconf \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that all this stuff -- automake, gcc, wget, etc. -- is left in the final image.
It would be great to leave all that behind in an earlier stage. (See above comment about using lr-basic as a first stage, and do all the compilation there.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now see that you remove most of it below. I still think it would be cleaner to remove it via staging.
sam_flag = read.flag | ||
pos = read.reference_start | ||
cigar = read.cigarstring | ||
signature = f"{n}-{chrom}-{pos}-{mq}-{sam_flag}-{cigar}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note that you've added the (very lengthy!) cigar string to the signature. I believe that the signature is already unnecessarily lengthy.
It seems to me that we could filter out secondary and supplementary alignments, and then just drop duplicate queryNames. After all, according to the SAM spec:
"For each read/contig in a SAM file, it is required that one and only one line associated with the
read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read."
You could still reset your dictionaries at each new position, assuming that the duplicates you're trying to eliminate always occur at the same position.
Edit: Maybe you're adding all these fields to the signature so that they appear in the output list of duplicate signatures. If so, so be it. But you could still do the comparison on just the query name.
docker/lr-basic/Dockerfile
Outdated
@@ -64,6 +64,7 @@ RUN apt-get -qqy update --fix-missing && \ | |||
gnupg \ | |||
less \ | |||
tabix \ | |||
tree \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add bsdmainutils, too. There are a number of WDLs that use the "column" command.
num_rgs=$(wc -l one_rg_per_line.txt | awk '{pritn $1}') | ||
if [[ ${num_rgs} -gt 1 ]]; then exit 1; fi | ||
|
||
cat one_rg_per_line.txt | tr '\t' '\n' > rh_header.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless cat. tr <one_rg_per_line.txt > rh_header.txt
wdl/tasks/Utility/BAMutils.wdl
Outdated
set -euxo pipefail | ||
|
||
# assumption | ||
sort_order=$(samtools view -H ~{qns_bam} | grep "^@HD" | tr '\t' '\n' | grep "^SO:" | awk -F ':' '{print $2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider sed rather than tr | grep | awk.
wdl/tasks/Utility/BAMutils.wdl
Outdated
# remote grab read names | ||
echo "false" > samtools.failed.txt | ||
samtools view ~{qns_bam} \ | ||
| awk -F '\t' '{print $1}' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would've expected cut -f1 to be faster than awk, but it isn't.
shows that you always have to test, i guess.
wdl/tasks/Utility/BAMutils.wdl
Outdated
|
||
set +e | ||
count=1 | ||
while true; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really surprised by two things:
- That this is necessary. The token is good for an hour. We're just processing a shard of a BAM.
- That it works. It appears to me that the token is only used once, at the time that the stream is opened.
Have you checked to see that re-auth is necessary and effective?
If so, I'd like to suggest that you put the re-auth while loop in the background rather than the main samtools command. You can then just abandon the background process when the main command exits: no testing counts or checking for exited processes.
|
||
Int disk_size = 3 * ceil(size(qnorder_bam, "GB")) | ||
|
||
command <<< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if you could vastly simplify this entire task with a simple awk program:
GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token` && export GCS_OAUTH_TOKEN
touch duplicated.readnames.txt
samtools view -h ~{qnorder_bam} | awk '
BEGIN{FS=OFS="\t"}
{if(substr($0,1,1)=="@")print;
else if($1==lastQName)print lastQName > "duplicated.readnames.txt";
else {print; lastQName = $1}
}' | samtools view -bh - > "~{base}.dedup.bam"
wdl/tasks/Utility/PBUtils.wdl
Outdated
cat readgroup.movies.txt | ||
|
||
# get primrose PG lines | ||
grep "^@PG" header.txt | grep -v "^@SQ" | grep "^@PG" | grep -F 'ID:primrose' | tr '\t' '\n' | grep '^CL:' > primrose.pg.lines.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to note this one: the 2nd and 3rd greps look superfluous to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments
input { | ||
File bam | ||
} | ||
output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have the output block after the command but before runtime block
String gcs_out_root_dir | ||
} | ||
|
||
output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the output block right after the input block helps quickly show what the workflow takes and what it gives. Just wanted to point out that miniwdl linting is throwing a warning because the tasks are being used before being called (WORKHORSE) in the workflow.
wdl/tasks/Utility/PBUtils.wdl
Outdated
desciption: "Verify that a PacBio's BAM has primrose run on all its read groups" | ||
} | ||
input { | ||
String bam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Optional) You could set the bam variable as a File and avoid localization by using the localization_optional
String platform_unit = GetReadGroupInfo.read_group_info['PU'] | ||
|
||
if (debug_mode) { | ||
call Utils.CountBamRecords { input: bam = bam } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment made by Ted but for CountBamRecords calls
call FF.FinalizeToFile as SaveFq { | ||
input: file = BamToFastq.reads_fq, outdir = outdir | ||
} | ||
if (debug_mode) { call Utils.CountFastqRecords { input: fastq = BamToFastq.reads_fq } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for CountFastqRecords, not being referenced
description: "Check several metadata of an input BAM (aliged? sort order? etc)" | ||
} | ||
input { | ||
File bam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The localization_optional: true
setting in the parameter_meta block of this task indicates to Cromwell that the file does not need to be downloaded
RuntimeAttr? runtime_attr_override | ||
} | ||
|
||
output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output block after command block
RuntimeAttr? runtime_attr_override | ||
} | ||
|
||
output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output after command block
wdl/tasks/Utility/BAMutils.wdl
Outdated
Int? num_ssds | ||
RuntimeAttr? runtime_attr_override | ||
} | ||
output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output after command block
wdl/tasks/Utility/BAMutils.wdl
Outdated
RuntimeAttr? runtime_attr_override | ||
} | ||
|
||
output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output after command block
12e954b
to
185abdb
Compare
* update Hifiasm to version 0.19.5 * update how Hifiasm outputs are compressed (bgz replacing gz), also * monitor hifiasm resources usage
* update docker used in PBSV tasks to the version coming with official SMRTLink releases (2.9.0) * change how the 2-step PBSV process is done (following the recommended way now)
* to version 2.0.7 * using TRF bed * conditionally phase sv (requires phased bam) * generates its own vcf.gz and tbi
Overhaul how small variants are called in the WG pipelines * default to use DV to call small variants, Clair3 analysis needs to be requested explicitly * retire the Pepper toolchain completely from the CCS pipeline, using DV directly * for R10.4+ ONT data, also use DV directly * older ONT data would still use the PEPPER-DV-Margin pipeline * offers GPU version (though based on, it's not worth it yet) * update how bam haplotagging is done Cleanup structural variants calling * experiment with SNF2 phasing SV calls (implicitly depends on small variants calling now) * tune PBSV calling - discover now supports --hifi - output vcf.gz and tbi - less verbose logging by default Misc.: * optimizations to BAM merging and metrics workflow * updates coverage collection step * new R script to visualize log from vm_monitoring_script.sh
* organize dockstore.yml file a bit * make WDL validation shell script more usable * update pbmm2 and pbindex to versions in SMRTLink * update GeneralUtils.wdl - two bash-like new tasks [CoerceMapToArrayOfPairs, CoerceArrayOfPairsToMap] - cleanup task CollapseArrayOfStrings * update resource allocations to tasks - NanoplotFromBam (also changes docker) - MosDepthWGS
* incorporates gcloud cli (not just gsutil) * integrate libdeflate for more speedups
incorporate new tasks and optimize them * [CountMethylCallReads, GatherReadsWithoutMethylCalls] from sh_beans * [GetPileup, BamToRelevantPileup] from sh_more_atomic_qc * [GetReadGroupLines, GetSortOrder, SplitNameSortedUbam] from sh_ont_fc * [SamtoolsFlagStats, ParseFlagStatsJson] from sh_trvial_stats * [FilterBamByLen, InferSampleName] from sh_seqkit * [CountAlignmentRecords, StreamingBamErrored, CountAlignmentRecordsByFlag] from sh_maha_aln_metrics * [ResetSamplename] from sh_ingest_singlerg * [MergeBamsWithSamtools] from sh_ont_fc.Utils.wdl * [BamToFastq] from sh_more_bam_qcs and optimize it with sh_ingest_singlerg.Utils.wdl delete * GetSortOrder as that's now implemented in GatherBamMetadata * Drop2304Alignments as that's no longer used update dockers to the latest
d3c809c
to
80e19c0
Compare
CHERRY-PICK FROM VARIOUS QC/METRICS BRANCHES: * collect information about ML/MM tags in a long-read BAM (sh_beans) * a heuristic way to find peaks in a distribution (using dyst) (sh_dyst_peaker) * filter reads by length in a BAM * collect some read quality stats from (length-filtered) FASTQ/BAM (sh_seq_kit) * VerifyBamID2 (for contamination estimation) * naive sex-concordance check (sh_more_atomic_qc) * check fingerprint of a single BAM file (sh_sample_fp) * collect SAM flag stats (sh_trivial_stats)
* make BeanCounter finalization optional (wdl/pipelines/TechAgnostic/Utility/CountTheBeans.wdl) * custom struct for sub-workflow config using a JSON (wdl/pipelines/TechAgnostic/Utility/LongReadsContaminationEstimation.wdl) * make fingerprint checking subworkflow control size filtering (wdl/tasks/QC/FPCheckAoU.wdl) (wdl/pipelines/TechAgnostic/Utility/VerifyBamFingerprint.wdl) * fix a warning by IDE/miniwdl complaining WDL stdlib function length only applies to Array (wdl/tasks/Utility/BAMutils.wdl) * various updates to Finalize (wdl/tasks/Utility/Finalize.wdl) New tasks in (wdl/tasks/Utility/GeneralUtils.wdl) to * correctly convert Map to TSV * concatenate files
a105f00
to
f0367e9
Compare
* SampleLevelAlignedMetrics.wdl * PBCLRWholeGenome.wdl
* new struct in AlignedBamQCandMetrics.wdl to facilicate as-sub-workflow calling * change parameters name for fingerprint workflows
* make saving of reads without methylation SAM tags optional * better parameter naming
c7a8e45
to
e397755
Compare
28a2bde
to
bbbaa8c
Compare
bbbaa8c
to
e1322e8
Compare
9337e79
to
ffd7624
Compare
(affects contamination estimation)
ffd7624
to
2657c15
Compare
* make BAM split by readgroup more efficient * make the ONT duplicated reads verification step resilient to http transient errors * new task to verify primrose was run on PacBio BAM * new task to get basecall model from ONT BAM
eb568d3
to
aa0e08d
Compare
* (for both PacBio, ONT) to split a BAM by read group, and reset the alignment * deduplicate and reset ONT aligned bam * optionally check primrose runs on PacBio input BAMs
aa0e08d
to
86c6a57
Compare
…, samtools fastq, samtools reset]
86c6a57
to
60517b4
Compare
Project-specific.