-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow up to PR #418 #430
Open
SHuang-Broad
wants to merge
35
commits into
main
Choose a base branch
from
sh_ingest_singlerg
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Follow up to PR #418 #430
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
e09dc65
UPDATES TO THE Hifiasm pipeline:
SHuang-Broad d3afc4b
For both CCS/ONT, update PBSV
SHuang-Broad 3a2ac5c
For both CCS/ONT, update Sniffles-2
SHuang-Broad 29aa964
CRITICAL: updating pbindex to that from the SMRTLink 12 release
SHuang-Broad 0262525
MAJOR REFACTOR: UNIFY CCS/ONT WGS PIPELINE
SHuang-Broad 7be9309
MISC UPDATES TO SEVERAL UTILS TASKS
SHuang-Broad 1ff0912
New docker that's intended to replace lr-basic:
SHuang-Broad 4224ab4
New docker that offers a new mode remove duplicates from a queryname-…
SHuang-Broad a13a541
New docker that updates the samtools in GATK to the latest (1.18)
SHuang-Broad 2905c69
significantly boost capabilities of BAMutils.wdl
SHuang-Broad b16c619
update some legacy code due to new interfaces in updated BAMutils.wdl
SHuang-Broad c96ea8e
New workflows
SHuang-Broad 39d77d2
Improve various existing codes
SHuang-Broad cbfed4e
New workflow to save files (avoid multiple calls to Finalize)
SHuang-Broad 8039c88
New workflow to unify QC and metrics collection on aligned BAM
SHuang-Broad b16155f
Update grouping of pipelines in .dockstore.yml
SHuang-Broad 5bf0b2f
update WDL validation bash script
SHuang-Broad 0d61935
DEPRECATION
SHuang-Broad 989ead1
Fix bug in deduplicating aligned ONT BAM
SHuang-Broad b8e67ff
DEPRECATION:
SHuang-Broad e5a79c2
Update utility code:
SHuang-Broad b5dc978
a few tweaks to to AlignedBamQCandMetrics:
SHuang-Broad 1abcc7a
correct two mistakes: 1. a typo; 2. should append not replace
SHuang-Broad da5bfb0
make read-length util tasks pre-emptible
SHuang-Broad 7975d6d
make fingerprint reads extraction more efficient/resilient
SHuang-Broad 5620d74
fix a stupid logical typo error (in saving files)
SHuang-Broad e910bac
option to force run fingerprintcheck on small bams
SHuang-Broad 04b82f8
make pbindex task more frugal
SHuang-Broad 14be4e2
Safer and more efficient way to do targetted pileup conversion
SHuang-Broad 136091a
Update Pacbio alignment task
SHuang-Broad d95be34
New subworkflow to align PacBio uBAM
SHuang-Broad 1e26de7
Two new workflows:
SHuang-Broad 44b7330
bug fix for bam2fastq
SHuang-Broad 526765e
optimize alignment further
SHuang-Broad 3dd95fa
test reducing resources for pbmm2 further
SHuang-Broad File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
FROM python:3.9.16-slim-bullseye | ||
|
||
COPY remove_duplicate_ont_aln.py /opt/ | ||
COPY remove_duplicate_ont_namesorted_unaligned.py /opt/ | ||
|
||
RUN pip install pysam==0.21.0 | ||
|
||
# install gcloud and gsutil cli | ||
ARG DEBIAN_FRONTEND=noninteractive | ||
RUN apt-get -qqy update --fix-missing && \ | ||
apt-get -qqy dist-upgrade && \ | ||
apt-get -qqy install --no-install-recommends \ | ||
apt-transport-https \ | ||
ca-certificates \ | ||
gnupg \ | ||
zlib1g-dev \ | ||
curl \ | ||
wget \ | ||
tree \ | ||
tabix && \ | ||
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \ | ||
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \ | ||
apt-get -qqy update && \ | ||
apt-get -qqy install --no-install-recommends google-cloud-cli && \ | ||
gcloud config set core/disable_usage_reporting true && \ | ||
gcloud config set component_manager/disable_update_check true && \ | ||
gcloud config set metrics/environment github_docker_image && \ | ||
apt-get -qqy purge gnupg && \ | ||
apt-get -qqy clean && \ | ||
rm -rf /tmp/* \ | ||
/var/tmp/* \ | ||
/var/cache/apt/* \ | ||
/var/lib/apt/lists/* \ | ||
/usr/share/man/?? \ | ||
/usr/share/man/??_* | ||
|
||
# install latest samtools | ||
ARG DEBIAN_FRONTEND=noninteractive | ||
ARG SAMTOOLS_VERSION=1.18 | ||
ARG BCFTOOLS_VERSION=1.18 | ||
RUN apt-get -qqy update --fix-missing && \ | ||
apt-get -qqy dist-upgrade && \ | ||
apt-get -qqy install --no-install-recommends \ | ||
ca-certificates \ | ||
libbz2-dev \ | ||
libcurl4-openssl-dev \ | ||
liblzma-dev \ | ||
libncurses5-dev \ | ||
autoconf \ | ||
automake \ | ||
bzip2 \ | ||
gcc \ | ||
make \ | ||
wget \ | ||
zlib1g-dev && \ | ||
wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \ | ||
tar xjf samtools-${SAMTOOLS_VERSION}.tar.bz2 && \ | ||
cd samtools-${SAMTOOLS_VERSION} && ./configure --without-curses --enable-libcurl && make -s all all-htslib && make install install-htslib && cd - && \ | ||
rm -rf samtools-${SAMTOOLS_VERSION}* && \ | ||
wget https://github.com/samtools/bcftools/releases/download/${BCFTOOLS_VERSION}/bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \ | ||
tar xjf bcftools-${BCFTOOLS_VERSION}.tar.bz2 && \ | ||
cd bcftools-${BCFTOOLS_VERSION} && ./configure --without-curses && make -s && make install && cd - && \ | ||
rm -rf bcftools-${BCFTOOLS_VERSION}* && \ | ||
apt-get -qqy purge autoconf automake bzip2 gcc make wget && \ | ||
apt-get -qqy clean && \ | ||
rm -rf /tmp/* \ | ||
/var/tmp/* \ | ||
/var/cache/apt/* \ | ||
/var/lib/apt/lists/* \ | ||
/usr/share/man/?? \ | ||
/usr/share/man/??_* && \ | ||
samtools --help && \ | ||
bcftools --help |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
VERSION = 0.1.2 | ||
TAG1 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:$(VERSION) | ||
TAG2 = us.gcr.io/broad-dsp-lrma/lr-bam-dedup:latest | ||
|
||
all: build push | ||
|
||
build: | ||
docker build -t $(TAG1) -t $(TAG2) . | ||
|
||
push: | ||
docker push $(TAG1) | ||
docker push $(TAG2) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
import argparse | ||
import pysam | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser(description='Remove redundant alignment records from ONT BAM file', | ||
prog='remove_redundant_reads') | ||
parser.add_argument('-p', '--prefix', type=str, default="shard", help="Output prefix") | ||
parser.add_argument('-a', '--annotations', type=str, help="Annotations on (potential) duplicate reads") | ||
|
||
parser.add_argument('bam', type=str, help="BAM") | ||
args = parser.parse_args() | ||
|
||
# create a dict of set's, a trick to avoid Hash collisions | ||
guilty_dict_per_chr = dict() | ||
with open(args.annotations) as f: | ||
for line in f: | ||
arr = line.strip().split('\t') | ||
name = arr[0] | ||
chrom = arr[2] | ||
guilty_dict_per_chr.setdefault(chrom, set()) | ||
guilty_dict_per_chr[chrom].add(name) | ||
|
||
print("chromosomes on which there are duplicate records:") | ||
print(f" {guilty_dict_per_chr}") | ||
|
||
# Silence message about the .bai file not being found. | ||
pysam.set_verbosity(0) | ||
|
||
num_alignments, num_dropped_alignments = 0, 0 | ||
duplicate_signatures = [] | ||
bf = pysam.Samfile(args.bam, 'rb', check_sq=False) | ||
with pysam.Samfile(f'{args.prefix}.bam', 'wb', header=bf.header) as out: | ||
# we rely on the observation that for coordinate sorted BAM, | ||
# duplicate records will appear in blocks, hence once we step off a position with duplicates, we start afresh | ||
current_position = -1 | ||
current_signatures = set() | ||
for read in bf: | ||
num_alignments += 1 | ||
|
||
chrom = read.reference_name | ||
n = read.query_name | ||
if chrom in guilty_dict_per_chr and n in guilty_dict_per_chr[chrom]: | ||
|
||
mq = read.mapping_quality | ||
sam_flag = read.flag | ||
pos = read.reference_start | ||
cigar = read.cigarstring | ||
signature = f"{n}-{chrom}-{pos}-{mq}-{sam_flag}-{cigar}" | ||
|
||
if current_position != pos: # new position, let's write and reset | ||
out.write(read) | ||
current_position = pos | ||
current_signatures = set() | ||
current_signatures.add(signature) | ||
elif signature in current_signatures: # You're a duplicate record, and not appearing for the 1st time! | ||
num_dropped_alignments += 1 | ||
duplicate_signatures.append(signature) # same signature may appear more than twice, hence list and append | ||
pass | ||
else: # you are in a new group of duplicates that map to this location | ||
out.write(read) | ||
current_signatures.add(signature) | ||
else: | ||
out.write(read) | ||
|
||
print(f'num_alignments: {num_alignments}') | ||
print(f'num_dropped_alignments: {num_dropped_alignments}') | ||
print(f'num_kept_alignments: {num_alignments - num_dropped_alignments}') | ||
|
||
with open(f'{args.prefix}.duplicate.signatures.txt', 'w') as out: | ||
for sig in duplicate_signatures: | ||
out.write(f"{sig}\n") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
48 changes: 48 additions & 0 deletions
48
docker/lr-bam-dedup/remove_duplicate_ont_namesorted_unaligned.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
import argparse | ||
import pysam | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser(description='Remove redundant reads from renamed-sorted ONT BAM file', | ||
prog='remove_redundant_reads') | ||
parser.add_argument('-p', '--prefix', type=str, default="shard", help="Output prefix") | ||
parser.add_argument('-q', '--qnames', type=str, help="Read names of duplicate records") | ||
|
||
parser.add_argument('bam', type=str, help="BAM") | ||
args = parser.parse_args() | ||
|
||
# Silence message about the .bai file not being found. | ||
pysam.set_verbosity(0) | ||
bf = pysam.Samfile(args.bam, 'rb', check_sq=False) | ||
|
||
num_records, num_dropped_records = 0, 0 | ||
duplicate_record_names = list() | ||
|
||
with pysam.Samfile(f'{args.prefix}.bam', 'wb', header=bf.header) as out: | ||
|
||
# we rely on the observation that for queryname sorted, unaligned BAM, | ||
# if two neighboring records have the same query name, then they must be duplicate of each other | ||
current_qm = '' | ||
|
||
for read in bf: | ||
num_records += 1 | ||
|
||
n = read.query_name | ||
if n == current_qm: | ||
duplicate_record_names.append(n) | ||
num_dropped_records += 1 | ||
else: | ||
current_qm = n | ||
out.write(read) | ||
|
||
print(f'num_records: {num_records}') | ||
print(f'num_dropped_records: {num_dropped_records}') | ||
print(f'num_kept_alignments: {num_records - num_dropped_records}') | ||
|
||
with open(args.qnames, 'w') as outf: | ||
for qn in duplicate_record_names: | ||
outf.write(f'{qn}\n') | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed this. Going to assume it's OK.