Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sz dyst peaker #457

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 56 additions & 32 deletions .dockstore.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,11 @@
version: 1.2
workflows:

###################################################
# deprecated
- name: ONT10x
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/ONT10x.wdl
- name: ONTWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/VariantCalling/ONTWholeGenome.wdl
- name: ONTFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcell.wdl
- name: PBFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Alignment/PBFlowcell.wdl
- name: PBCCS
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCCS.wdl
Expand All @@ -24,21 +18,24 @@ workflows:
- name: PBCCSDemultiplexWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCCSDemultiplexWholeGenome.wdl
- name: PBCCSIsoSeq
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBCCSIsoSeq.wdl
- name: PBCCSWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl
- name: PBCLRDemultiplexWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCLRDemultiplexWholeGenome.wdl
- name: PBCLRWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/PBCLRWholeGenome.wdl
- name: LRCNVs
- name: DownloadFromHudsonAlpha
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRCNVs.wdl
primaryDescriptorPath: /wdl/deprecated/DownloadFromHudsonAlpha.wdl

###################################################
# ONT
- name: ONTWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/VariantCalling/ONTWholeGenome.wdl
- name: ONTFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcell.wdl
- name: ONTBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTBasecall.wdl
Expand All @@ -48,15 +45,6 @@ workflows:
- name: ONTAssembleWithFlye
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Assembly/ONTAssembleWithFlye.wdl
- name: VerifyFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyFingerprint.wdl
- name: DownloadFromHudsonAlpha
subclass: wdl
primaryDescriptorPath: /wdl/deprecated/DownloadFromHudsonAlpha.wdl
- name: PBAssembleWithHifiasm
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Assembly/PBAssembleWithHifiasm.wdl
- name: ONTMethylation
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Epigenomics/ONTMethylation.wdl
Expand All @@ -66,12 +54,42 @@ workflows:
- name: ONTPfTypeDrugResistanceMarkers
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/MultiAnalysis/ONTPfTypeDrugResistanceMarkers.wdl
- name: ONTProcessBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTProcessBasecall.wdl
- name: ONTFlowcellFromMultipleBasecalls
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcellFromMultipleBasecalls.wdl

###################################################
# PacBio
- name: PBFlowcell
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Alignment/PBFlowcell.wdl
- name: PBCCSIsoSeq
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBCCSIsoSeq.wdl
- name: PBCCSWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl
- name: PBAssembleWithHifiasm
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Assembly/PBAssembleWithHifiasm.wdl
- name: PBMASIsoSeqQuantify
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBMASIsoSeqQuantify.wdl
- name: PBMASIsoSeqDemultiplex
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/PacBio/Utility/PBMASIsoSeqDemultiplex.wdl

###################################################
# TechAgnostic
- name: LRCNVs
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRCNVs.wdl
- name: VerifyFingerprint
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/VerifyFingerprint.wdl
- name: LRJointCallGVCFs
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/LRJointCallGVCFs.wdl
Expand All @@ -87,12 +105,18 @@ workflows:
- name: DownloadFromWeb
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DownloadFromWeb.wdl
- name: ONTProcessBasecall
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTProcessBasecall.wdl
- name: ONTFlowcellFromMultipleBasecalls
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/ONT/Preprocessing/ONTFlowcellFromMultipleBasecalls.wdl
- name: CleanupIntermediate
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CleanupIntermediate.wdl
- name: MergeSampleBamsAndCollectMetrics
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/MergeSampleBamsAndCollectMetrics.wdl
- name: ShardWholeGenome
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/ShardWholeGenome.wdl
- name: CallVariantsReadBased
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/VariantCalling/CallVariantsReadBased.wdl
- name: DystPeaker
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/DystPeaker.wdl
73 changes: 73 additions & 0 deletions docker/lr-bam-dedup/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
FROM python:3.9.16-slim-bullseye

COPY remove_duplicate_ont_aln.py /opt/
COPY remove_duplicate_ont_namesorted_unaligned.py /opt/

RUN pip install pysam==0.21.0

# install gcloud and gsutil cli
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
apt-transport-https \
ca-certificates \
gnupg \
zlib1g-dev \
curl \
wget \
tree \
tabix && \
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
apt-get -qqy update && \
apt-get -qqy install --no-install-recommends google-cloud-cli && \
gcloud config set core/disable_usage_reporting true && \
gcloud config set component_manager/disable_update_check true && \
gcloud config set metrics/environment github_docker_image && \
apt-get -qqy purge gnupg && \
apt-get -qqy clean && \
rm -rf /tmp/* \
/var/tmp/* \
/var/cache/apt/* \
/var/lib/apt/lists/* \
/usr/share/man/?? \
/usr/share/man/??_*

# install latest samtools
ARG DEBIAN_FRONTEND=noninteractive
ARG SAMTOOLS_VERSION=1.18
ARG BCFTOOLS_VERSION=1.18
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
ca-certificates \
libbz2-dev \
libcurl4-openssl-dev \
liblzma-dev \
libncurses5-dev \
autoconf \
automake \
bzip2 \
gcc \
make \
wget \
zlib1g-dev && \
wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
tar xjf samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
cd samtools-${SAMTOOLS_VERSION} && ./configure --without-curses --enable-libcurl && make -s all all-htslib && make install install-htslib && cd - && \
rm -rf samtools-${SAMTOOLS_VERSION}* && \
wget https://github.com/samtools/bcftools/releases/download/${BCFTOOLS_VERSION}/bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \
tar xjf bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \
cd bcftools-${BCFTOOLS_VERSION} && ./configure --without-curses && make -s && make install && cd - && \
rm -rf bcftools-${BCFTOOLS_VERSION}* && \
apt-get -qqy purge autoconf automake bzip2 gcc make wget && \
apt-get -qqy clean && \
rm -rf /tmp/* \
/var/tmp/* \
/var/cache/apt/* \
/var/lib/apt/lists/* \
/usr/share/man/?? \
/usr/share/man/??_* && \
samtools --help && \
bcftools --help
12 changes: 12 additions & 0 deletions docker/lr-bam-dedup/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
VERSION = 0.1.2
TAG1 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:$(VERSION)
TAG2 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:latest

all: build push

build:
docker build -t $(TAG1) -t $(TAG2) .

push:
docker push $(TAG1)
docker push $(TAG2)
76 changes: 76 additions & 0 deletions docker/lr-bam-dedup/remove_duplicate_ont_aln.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import argparse
import pysam


def main():
parser = argparse.ArgumentParser(description='Remove redundant alignment records from ONT BAM file',
prog='remove_redundant_reads')
parser.add_argument('-p', '--prefix', type=str, default="shard", help="Output prefix")
parser.add_argument('-a', '--annotations', type=str, help="Annotations on (potential) duplicate reads")

parser.add_argument('bam', type=str, help="BAM")
args = parser.parse_args()

# create a dict of set's, a trick to avoid Hash collisions
guilty_dict_per_chr = dict()
with open(args.annotations) as f:
for line in f:
arr = line.strip().split('\t')
name = arr[0]
chrom = arr[2]
guilty_dict_per_chr.setdefault(chrom, set())
guilty_dict_per_chr[chrom].add(name)

print("chromosomes on which there are duplicate records:")
print(f" {guilty_dict_per_chr}")

# Silence message about the .bai file not being found.
pysam.set_verbosity(0)

num_alignments, num_dropped_alignments = 0, 0
duplicate_signatures = []
bf = pysam.Samfile(args.bam, 'rb', check_sq=False)
with pysam.Samfile(f'{args.prefix}.bam', 'wb', header=bf.header) as out:
# we rely on the observation that for coordinate sorted BAM,
# duplicate records will appear in blocks, hence once we step off a position with duplicates, we start afresh
current_position = -1
current_signatures = set()
for read in bf:
num_alignments += 1

chrom = read.reference_name
n = read.query_name
if chrom in guilty_dict_per_chr and n in guilty_dict_per_chr[chrom]:

mq = read.mapping_quality
sam_flag = read.flag
pos = read.reference_start
cigar = read.cigarstring
signature = f"{n}-{chrom}-{pos}-{mq}-{sam_flag}-{cigar}"

if current_position != pos: # new position, let's write and reset
out.write(read)
current_position = pos
current_signatures = set()
current_signatures.add(signature)
elif signature in current_signatures: # You're a duplicate record, and not appearing for the 1st time!
num_dropped_alignments += 1
duplicate_signatures.append(signature) # same signature may appear more than twice, hence list and append
pass
else: # you are in a new group of duplicates that map to this location
out.write(read)
current_signatures.add(signature)
else:
out.write(read)

print(f'num_alignments: {num_alignments}')
print(f'num_dropped_alignments: {num_dropped_alignments}')
print(f'num_kept_alignments: {num_alignments - num_dropped_alignments}')

with open(f'{args.prefix}.duplicate.signatures.txt', 'w') as out:
for sig in duplicate_signatures:
out.write(f"{sig}\n")


if __name__ == "__main__":
main()
48 changes: 48 additions & 0 deletions docker/lr-bam-dedup/remove_duplicate_ont_namesorted_unaligned.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import argparse
import pysam


def main():
parser = argparse.ArgumentParser(description='Remove redundant reads from renamed-sorted ONT BAM file',
prog='remove_redundant_reads')
parser.add_argument('-p', '--prefix', type=str, default="shard", help="Output prefix")
parser.add_argument('-q', '--qnames', type=str, help="Read names of duplicate records")

parser.add_argument('bam', type=str, help="BAM")
args = parser.parse_args()

# Silence message about the .bai file not being found.
pysam.set_verbosity(0)
bf = pysam.Samfile(args.bam, 'rb', check_sq=False)

num_records, num_dropped_records = 0, 0
duplicate_record_names = list()

with pysam.Samfile(f'{args.prefix}.bam', 'wb', header=bf.header) as out:

# we rely on the observation that for queryname sorted, unaligned BAM,
# if two neighboring records have the same query name, then they must be duplicate of each other
current_qm = ''

for read in bf:
num_records += 1

n = read.query_name
if n == current_qm:
duplicate_record_names.append(n)
num_dropped_records += 1
else:
current_qm = n
out.write(read)

print(f'num_records: {num_records}')
print(f'num_dropped_records: {num_dropped_records}')
print(f'num_kept_alignments: {num_records - num_dropped_records}')

with open(args.qnames, 'w') as outf:
for qn in duplicate_record_names:
outf.write(f'{qn}\n')


if __name__ == "__main__":
main()
Loading
Loading