Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline for ingesting high-coverage PB/ONT aligned BAMs and format them for re-processing #418

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
e09dc65
UPDATES TO THE Hifiasm pipeline:
SHuang-Broad May 8, 2023
d3afc4b
For both CCS/ONT, update PBSV
SHuang-Broad May 10, 2023
3a2ac5c
For both CCS/ONT, update Sniffles-2
SHuang-Broad May 10, 2023
29aa964
CRITICAL: updating pbindex to that from the SMRTLink 12 release
SHuang-Broad May 12, 2023
0262525
MAJOR REFACTOR: UNIFY CCS/ONT WGS PIPELINE
SHuang-Broad May 11, 2023
7be9309
MISC UPDATES TO SEVERAL UTILS TASKS
SHuang-Broad Nov 26, 2023
1ff0912
New docker that's intended to replace lr-basic:
SHuang-Broad Aug 11, 2023
4224ab4
New docker that offers a new mode remove duplicates from a queryname-…
SHuang-Broad Dec 1, 2023
a13a541
New docker that updates the samtools in GATK to the latest (1.18)
SHuang-Broad Nov 30, 2023
2905c69
significantly boost capabilities of BAMutils.wdl
SHuang-Broad Nov 28, 2023
b16c619
update some legacy code due to new interfaces in updated BAMutils.wdl
SHuang-Broad Dec 1, 2023
c96ea8e
New workflows
SHuang-Broad Dec 1, 2023
39d77d2
Improve various existing codes
SHuang-Broad Dec 11, 2023
cbfed4e
New workflow to save files (avoid multiple calls to Finalize)
SHuang-Broad Dec 10, 2023
8039c88
New workflow to unify QC and metrics collection on aligned BAM
SHuang-Broad Dec 15, 2023
b16155f
Update grouping of pipelines in .dockstore.yml
SHuang-Broad Dec 19, 2023
5bf0b2f
update WDL validation bash script
SHuang-Broad Dec 22, 2023
0d61935
DEPRECATION
SHuang-Broad Dec 1, 2023
989ead1
Fix bug in deduplicating aligned ONT BAM
SHuang-Broad Dec 4, 2023
b8e67ff
DEPRECATION:
SHuang-Broad Jan 2, 2024
e5a79c2
Update utility code:
SHuang-Broad Jan 3, 2024
b5dc978
a few tweaks to to AlignedBamQCandMetrics:
SHuang-Broad Jan 12, 2024
1abcc7a
correct two mistakes: 1. a typo; 2. should append not replace
SHuang-Broad Jan 18, 2024
da5bfb0
make read-length util tasks pre-emptible
SHuang-Broad Jan 22, 2024
7975d6d
make fingerprint reads extraction more efficient/resilient
SHuang-Broad Jan 23, 2024
5620d74
fix a stupid logical typo error (in saving files)
SHuang-Broad Jan 23, 2024
e910bac
option to force run fingerprintcheck on small bams
SHuang-Broad Jan 31, 2024
04b82f8
make pbindex task more frugal
SHuang-Broad Jan 23, 2024
14be4e2
Safer and more efficient way to do targetted pileup conversion
SHuang-Broad Feb 25, 2024
1f8dcae
new task allowing one to send email in a WDL task
SHuang-Broad Mar 9, 2024
1ac2a31
lower resources for seqkit stats and read length gathering from fastq
SHuang-Broad Mar 23, 2024
ce90a22
Chaning utility WDLs:
SHuang-Broad Jan 18, 2024
3d3a508
New workflows
SHuang-Broad Dec 4, 2023
60517b4
make several BAM ops more frugal: [GATK ValidateSamFile, GATK SortSam…
SHuang-Broad Jan 23, 2024
a3da0a6
a typo bug fix
SHuang-Broad May 12, 2024
6775bd3
seeing new format in ONT RG headers
SHuang-Broad May 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 102 additions & 33 deletions .dockstore.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,11 @@
version: 1.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at this in detail. I'm going to assume it's OK.

workflows:

###################################################
# deprecated
- name: ONT10x
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/ONT10x.wdl
- name: ONTWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/VariantCalling/ONTWholeGenome.wdl
- name: ONTFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcell.wdl
- name: PBFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Alignment/PBFlowcell.wdl
- name: PBCCS
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCCS.wdl
Expand All @@ -24,21 +18,24 @@ workflows:
- name: PBCCSDemultiplexWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCCSDemultiplexWholeGenome.wdl
- name: PBCCSIsoSeq
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBCCSIsoSeq.wdl
- name: PBCCSWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl
- name: PBCLRDemultiplexWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCLRDemultiplexWholeGenome.wdl
- name: PBCLRWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCLRWholeGenome.wdl
- name: LRCNVs
- name: DownloadFromHudsonAlpha
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRCNVs.wdl
primaryDescriptorPath: /wdl/deprecated/DownloadFromHudsonAlpha.wdl

###################################################
# ONT
- name: ONTWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/VariantCalling/ONTWholeGenome.wdl
- name: ONTFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcell.wdl
- name: ONTBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTBasecall.wdl
Expand All @@ -48,15 +45,6 @@ workflows:
- name: ONTAssembleWithFlye
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Assembly/ONTAssembleWithFlye.wdl
- name: VerifyFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyFingerprint.wdl
- name: DownloadFromHudsonAlpha
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/DownloadFromHudsonAlpha.wdl
- name: PBAssembleWithHifiasm
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Assembly/PBAssembleWithHifiasm.wdl
- name: ONTMethylation
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Epigenomics/ONTMethylation.wdl
Expand All @@ -66,12 +54,51 @@ workflows:
- name: ONTPfTypeDrugResistanceMarkers
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/MultiAnalysis/ONTPfTypeDrugResistanceMarkers.wdl
- name: ONTProcessBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTProcessBasecall.wdl
- name: ONTFlowcellFromMultipleBasecalls
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcellFromMultipleBasecalls.wdl
- name: RemoveDuplicateFromMergedONTBamAndSplitByReadgroup
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/RemoveDuplicateFromMergedONTBamAndSplitByReadgroup.wdl
- name: DeduplicateAndResetONTAlignedBam
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/DeduplicateAndResetONTAlignedBam.wdl

###################################################
# PacBio
- name: PBFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Alignment/PBFlowcell.wdl
- name: PBCCSIsoSeq
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBCCSIsoSeq.wdl
- name: PBCCSWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl
- name: PBAssembleWithHifiasm
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Assembly/PBAssembleWithHifiasm.wdl
- name: PBMASIsoSeqQuantify
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBMASIsoSeqQuantify.wdl
- name: PBMASIsoSeqDemultiplex
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBMASIsoSeqDemultiplex.wdl
- name: SplitMergedPacBioBamByReadgroup
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/SplitMergedPacBioBamByReadgroup.wdl

###################################################
# TechAgnostic - *mics data processing
- name: CallVariantsReadBased
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/CallVariantsReadBased.wdl
- name: LRCNVs
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRCNVs.wdl
- name: LRJointCallGVCFs
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRJointCallGVCFs.wdl
Expand All @@ -81,18 +108,60 @@ workflows:
- name: LRConvertBCF
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/LRConvertBCF.wdl

###################################################
# TechAgnostic - *mics data QC & metrics
- name: ShardWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/ShardWholeGenome.wdl
- name: MergeSampleBamsAndCollectMetrics
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/MergeSampleBamsAndCollectMetrics.wdl
- name: AlignedBamQCandMetrics
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/AlignedBamQCandMetrics.wdl
- name: CollectBamFlagStats
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CollectBamFlagStats.wdl
- name: CountTheBeans
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CountTheBeans.wdl
- name: DystPeaker
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DystPeaker.wdl
- name: FASTQstats
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/FASTQstats.wdl
- name: FilterBamByLength
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/FilterBamByLength.wdl
- name: LongReadsContaminationEstimation
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/LongReadsContaminationEstimation.wdl
- name: SexCheckNaive
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/SexCheckNaive.wdl
- name: VerifyFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyFingerprint.wdl
- name: VerifyBamFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyBamFingerprint.wdl

###################################################
# TechAgnostic - utility
- name: CleanupIntermediate
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CleanupIntermediate.wdl
- name: DownloadFromSRA
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DownloadFromSRA.wdl
- name: DownloadFromWeb
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DownloadFromWeb.wdl
- name: ONTProcessBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTProcessBasecall.wdl
- name: ONTFlowcellFromMultipleBasecalls
- name: SaveFilesToDestination
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcellFromMultipleBasecalls.wdl
- name: CleanupIntermediate
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/SaveFilesToDestination.wdl
- name: SplitBamByReadgroup
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CleanupIntermediate.wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/SplitBamByReadgroup.wdl
73 changes: 73 additions & 0 deletions docker/lr-bam-dedup/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
FROM python:3.9.16-slim-bullseye
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dockerfiles are so long, and have so much repeated content.
Wouldn't it be possible to start with your lr-gcloud-samtools image as a first stage?


COPY remove_duplicate_ont_aln.py /opt/
COPY remove_duplicate_ont_namesorted_unaligned.py /opt/

RUN pip install pysam==0.21.0

# install gcloud and gsutil cli
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
apt-transport-https \
ca-certificates \
gnupg \
zlib1g-dev \
curl \
wget \
tree \
tabix && \
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
apt-get -qqy update && \
apt-get -qqy install --no-install-recommends google-cloud-cli && \
gcloud config set core/disable_usage_reporting true && \
gcloud config set component_manager/disable_update_check true && \
gcloud config set metrics/environment github_docker_image && \
apt-get -qqy purge gnupg && \
apt-get -qqy clean && \
rm -rf /tmp/* \
/var/tmp/* \
/var/cache/apt/* \
/var/lib/apt/lists/* \
/usr/share/man/?? \
/usr/share/man/??_*

# install latest samtools
ARG DEBIAN_FRONTEND=noninteractive
ARG SAMTOOLS_VERSION=1.18
ARG BCFTOOLS_VERSION=1.18
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
ca-certificates \
libbz2-dev \
libcurl4-openssl-dev \
liblzma-dev \
libncurses5-dev \
autoconf \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that all this stuff -- automake, gcc, wget, etc. -- is left in the final image.
It would be great to leave all that behind in an earlier stage. (See above comment about using lr-basic as a first stage, and do all the compilation there.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that you remove most of it below. I still think it would be cleaner to remove it via staging.

automake \
bzip2 \
gcc \
make \
wget \
zlib1g-dev && \
wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
tar xjf samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
cd samtools-${SAMTOOLS_VERSION} && ./configure --without-curses --enable-libcurl && make -s all all-htslib && make install install-htslib && cd - && \
rm -rf samtools-${SAMTOOLS_VERSION}* && \
wget https://github.com/samtools/bcftools/releases/download/${BCFTOOLS_VERSION}/bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \
tar xjf bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \
cd bcftools-${BCFTOOLS_VERSION} && ./configure --without-curses && make -s && make install && cd - && \
rm -rf bcftools-${BCFTOOLS_VERSION}* && \
apt-get -qqy purge autoconf automake bzip2 gcc make wget && \
apt-get -qqy clean && \
rm -rf /tmp/* \
/var/tmp/* \
/var/cache/apt/* \
/var/lib/apt/lists/* \
/usr/share/man/?? \
/usr/share/man/??_* && \
samtools --help && \
bcftools --help
12 changes: 12 additions & 0 deletions docker/lr-bam-dedup/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
VERSION = 0.1.2
TAG1 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:$(VERSION)
TAG2 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:latest

all: build push

build:
docker build -t $(TAG1) -t $(TAG2) .

push:
docker push $(TAG1)
docker push $(TAG2)
76 changes: 76 additions & 0 deletions docker/lr-bam-dedup/remove_duplicate_ont_aln.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import argparse
import pysam


def main():
parser = argparse.ArgumentParser(description='Remove redundant alignment records from ONT BAM file',
prog='remove_redundant_reads')
parser.add_argument('-p', '--prefix', type=str, default="shard", help="Output prefix")
parser.add_argument('-a', '--annotations', type=str, help="Annotations on (potential) duplicate reads")

parser.add_argument('bam', type=str, help="BAM")
args = parser.parse_args()

# create a dict of set's, a trick to avoid Hash collisions
guilty_dict_per_chr = dict()
with open(args.annotations) as f:
for line in f:
arr = line.strip().split('\t')
name = arr[0]
chrom = arr[2]
guilty_dict_per_chr.setdefault(chrom, set())
guilty_dict_per_chr[chrom].add(name)

print("chromosomes on which there are duplicate records:")
print(f" {guilty_dict_per_chr}")

# Silence message about the .bai file not being found.
pysam.set_verbosity(0)

num_alignments, num_dropped_alignments = 0, 0
duplicate_signatures = []
bf = pysam.Samfile(args.bam, 'rb', check_sq=False)
with pysam.Samfile(f'{args.prefix}.bam', 'wb', header=bf.header) as out:
# we rely on the observation that for coordinate sorted BAM,
# duplicate records will appear in blocks, hence once we step off a position with duplicates, we start afresh
current_position = -1
current_signatures = set()
for read in bf:
num_alignments += 1

chrom = read.reference_name
n = read.query_name
if chrom in guilty_dict_per_chr and n in guilty_dict_per_chr[chrom]:

mq = read.mapping_quality
sam_flag = read.flag
pos = read.reference_start
cigar = read.cigarstring
signature = f"{n}-{chrom}-{pos}-{mq}-{sam_flag}-{cigar}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note that you've added the (very lengthy!) cigar string to the signature. I believe that the signature is already unnecessarily lengthy.
It seems to me that we could filter out secondary and supplementary alignments, and then just drop duplicate queryNames. After all, according to the SAM spec:
"For each read/contig in a SAM file, it is required that one and only one line associated with the
read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read."
You could still reset your dictionaries at each new position, assuming that the duplicates you're trying to eliminate always occur at the same position.
Edit: Maybe you're adding all these fields to the signature so that they appear in the output list of duplicate signatures. If so, so be it. But you could still do the comparison on just the query name.


if current_position != pos: # new position, let's write and reset
out.write(read)
current_position = pos
current_signatures = set()
current_signatures.add(signature)
elif signature in current_signatures: # You're a duplicate record, and not appearing for the 1st time!
num_dropped_alignments += 1
duplicate_signatures.append(signature) # same signature may appear more than twice, hence list and append
pass
else: # you are in a new group of duplicates that map to this location
out.write(read)
current_signatures.add(signature)
else:
out.write(read)

print(f'num_alignments: {num_alignments}')
print(f'num_dropped_alignments: {num_dropped_alignments}')
print(f'num_kept_alignments: {num_alignments - num_dropped_alignments}')

with open(f'{args.prefix}.duplicate.signatures.txt', 'w') as out:
for sig in duplicate_signatures:
out.write(f"{sig}\n")


if __name__ == "__main__":
main()
Loading
Loading