Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mosdepth cov stat #454

Open
wants to merge 63 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
611c6c3
Added CoverageStats.wdl
Aug 17, 2023
b825dc9
added "_coverage" to headers
Aug 17, 2023
fa49a9d
Rounding to the nearest hundreth and centralized header suffix
Aug 18, 2023
dffa95e
removed old rounding line
Aug 18, 2023
be08fcc
More parameters to mosdepthoverbed task
May 2, 2024
9598e34
More parameters to mosdepthoverbed task,
May 9, 2024
24e0a3f
Added subheaders in CoverageStats task command, use bc to get D2_sum,…
May 10, 2024
d33b8cc
use awk to get D2_sum
May 10, 2024
f566382
updated awk command to get D2_sum
May 10, 2024
9e42b15
updated awk command to get D2_sum with $
May 10, 2024
8d47439
using python for evenness calculation because awk bash doesn't suppor…
May 10, 2024
a6df7de
fixed e score formula
May 16, 2024
93ba31a
round e-score to hundredth
May 16, 2024
26e6012
Merge branch 'main' into bs_mosdepth_cov_stat
May 20, 2024
310f66f
removed `mosdepth_by_region` intermediate variable
May 23, 2024
8d6fba6
threads option added
May 23, 2024
5cf9f47
minor fix
May 23, 2024
f3f6dd6
if mean zero, evenness score is none
Jul 23, 2024
fe52543
graceful exit if datamash calculation failed
Jul 23, 2024
d15d9f7
coverage evenness gets 'null' if unable to calculate
Jul 24, 2024
3118aa7
transformed shell script to python and added to mosdepth Dockerfile
Jul 24, 2024
2d1e64e
fixed coverage_stats.py option names
Jul 26, 2024
9ce9549
changed column names
Jul 26, 2024
49445d8
new line eof for output file
Jul 26, 2024
6431ff2
moved changes to bs_coverage_stats_script_mosdepth branch
Jul 26, 2024
ac916c0
"Addition: logic for bed_file with one inteval in MosdepthCoverageSta…
Aug 1, 2024
f46a66b
"Addition: reformated logic for bed_file with one inteval in Mosdepth…
Aug 5, 2024
2608f48
"Addition: set parameters to mosdepth, so readability is improved in …
Aug 5, 2024
a641a06
"Addition: add absolute path to coverage_stats.py"
Aug 6, 2024
b6eb2e7
"Addition: combined mosdepth task and coverage"
Aug 6, 2024
8e3e630
"Addition: fix optional output"
Aug 6, 2024
4a5b11f
"Addition: fix optional output"
Aug 6, 2024
1131d85
"Addition: added a scatter option"
Aug 6, 2024
08a2bb4
"Addition: memory parameter added "
Aug 6, 2024
ec15f62
"Addition: memory parameter added to workflow parameter"
Aug 6, 2024
39b3331
"Addition: fail if scatter interval is over 200"
Aug 6, 2024
087788d
"Addition: check bed length in scatter regardless"
Aug 6, 2024
5e49318
"Addition: mosdepth summarizes regions"
Aug 6, 2024
f798f33
"Addition: mini fix"
Aug 6, 2024
8bfcfa7
"Addition: created separate task for cov per interval"
Aug 7, 2024
d6a5ea1
"Addition: changed for loop for per interval"
Aug 7, 2024
c988ddf
"Addition: fix bed_line to line in for loop for per interval"
Aug 7, 2024
9f7fe0e
"Addition: fix white space to be tab in for loop for per interval"
Aug 7, 2024
23f29c8
"Addition: creating json list for cover per interval"
Aug 7, 2024
39e4c80
"Addition: fix for first line json list for cover per interval"
Aug 7, 2024
a763344
"Addition: fix output to be list of dict for cover per interval"
Aug 7, 2024
a3d06e2
"Addition: fix output to be object value for cover per interval"
Aug 7, 2024
e5daa8a
"Addition: fix output to be object value for cover per interval"
Aug 7, 2024
13b0350
"Addition: fix output to be object value for cover per interval"
Aug 7, 2024
1e8a13f
"Addition: made max bed interval value a parameter for cover per inte…
Aug 7, 2024
b58a0ed
"Addition: added extra disk parameter"
Aug 7, 2024
3ba18d4
"Addition: removed cpu details from runtime to avoid expensive custom…
Aug 7, 2024
8a8db15
"Addition: added interval to per interval json
Aug 15, 2024
2cb6d44
"Addition: fix sed command
Aug 15, 2024
a262247
"Addition: fix sed command with backslash for double quotes
Aug 15, 2024
2cf2d12
"Fix: updated quotation for sed in mosdepthperinterval"
Aug 19, 2024
8d7f197
"Fix: updated mosdepthperinterval to skip interval if mosdepth failed"
Aug 19, 2024
8a13b10
"Fix: updated mosdepthperinterval fail loud if mosdepth fails"
Aug 19, 2024
e0cf7d3
"Fix: updated mosdepthperinterval to disable per base reads "
Aug 22, 2024
35ea1d3
"Fix: updated mosdepthperinterval to use samtools to create new bam i…
Aug 26, 2024
e6c2f8b
"Fix: updated mosdepthperinterval to use samtools to create new bam i…
Aug 26, 2024
9cff055
"Fix: updated mosdepthperinterval to use samtools --region-file"
Aug 26, 2024
f177eff
"Fix: updated mosdepthperinterval to use samtools -M and -L"
Aug 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ workflows:
- name: CleanupIntermediate
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CleanupIntermediate.wdl
- name: CoverageStats
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/CoverageStats.wdl
- name: BenchmarkVCFs
subclass: wdl
primaryDescriptorPath: /wdl/pipelines/TechAgnostic/Utility/BenchmarkVCFs.wdl
Expand Down
376 changes: 376 additions & 0 deletions wdl/pipelines/TechAgnostic/Utility/CoverageStats.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,376 @@
version 1.0

workflow MosdepthCoverageStats {

meta {
description: "Calculate coverage statistics using mosdepth."
}
parameter_meta {
aligned_bam: "Aligned BAM file."
aligned_bai: "Aligned BAM index file."
bed_file: "BED file containing regions of interest. Workflow will fail if bed file is empty"
bin_length: "Length of bins to use for coverage calculation."
preemptible_tries: "Number of times to retry a preempted task."
stats_per_interval: "If true, calculate coverage statistics per interval. Must provide a bed file. Workflow will fail if bed file contains more than 200 intervals to avoid accidently over spending."
max_bed_intervals: "Maximum number of intervals allowed in the bed file for MosdpethPerInterval. Default is 200."
}

input {
File aligned_bam
File aligned_bai
File? bed_file

Int? bin_length

Boolean stats_per_interval = false
Int max_bed_intervals = 200

# Runtime parameters
Int preemptible_tries = 3
Int mosdepth_mem = 8
Int mosdepth_extra_disk = 0
}

if (defined(bed_file)) {
if (length(read_lines(select_first([bed_file]))) == 0) {
call FailWorkflow as EmptyBedFile {
input:
message = "stats_per_interval is set to true, but the provided bed file is empty."
}
}
}

if (stats_per_interval) {
if (!defined(bed_file)) {
call FailWorkflow as NoBedFile {
input:
message = "stats_per_interval is set to true, but no bed file is provided."
}
}
if (length(read_lines(select_first([bed_file]))) > max_bed_intervals) {
call FailWorkflow as TooManyIntervals {
input:
message = "stats_per_interval is set to true, but the provided bed file contains more than 200 intervals."
}
}
call MosDepthPerInterval {
input:
bam = aligned_bam,
bai = aligned_bai,
bed = bed_file,
preemptible_tries = preemptible_tries,
mem = mosdepth_mem,
extra_disk = mosdepth_extra_disk
}
}



if (defined(bed_file) && defined(bin_length)) {
call BinBed {
input:
bed_file = bed_file,
bin_length = bin_length
}
}

# Runs in cases where bed file is provided or bed and bin_length are provided
if (defined(bed_file) || (defined(bin_length) && defined(BinBed.binned_bed))) {
call MosDepthOverBed {
input:
bam = aligned_bam,
bai = aligned_bai,
bed = select_first([BinBed.binned_bed, bed_file]),
bin_length = bin_length,
preemptible_tries = preemptible_tries,
mem = mosdepth_mem,
extra_disk = mosdepth_extra_disk,
summarize_regions = true,
}
}

# Runs in cases where no bed file is provided, will run regardless of bin_length
if (!defined(bed_file) ) {
call MosDepthOverBed as MosDepthNoBed {
input:
bam = aligned_bam,
bai = aligned_bai,
bin_length = bin_length,
preemptible_tries = preemptible_tries,
mem = mosdepth_mem,
extra_disk = mosdepth_extra_disk,
}
}


output {
File cov_stat_summary_file = select_first([MosDepthOverBed.cov_stat_summary_file, MosDepthNoBed.cov_stat_summary_file])
Map[String, Float] cov_stat_summary = select_first([MosDepthOverBed.cov_stat_summary, MosDepthNoBed.cov_stat_summary])
File? stats_per_interval_cov_stat_summary_files = MosDepthPerInterval.cov_stat_summary_all_file
# Array[Map[String, Object]]? stats_per_interval_cov_stat_summaries = MosDepthPerInterval.cov_stat_summary_all
}

}


task MosDepthOverBed {

meta {
description: "Calculate coverage using mosdepth."
}
parameter_meta {
bam: "Aligned BAM file."
bai: "Aligned BAM index file."
bed: "BED file containing regions of interest."
bin_length: "Length of bins to use for coverage calculation. default is 1000. If bed file is provided, this parameter is ignored."
chrom: "Chromosome to calculate coverage for."
preemptible_tries: "Number of times to retry a preempted task."
extra_disk: "Extra disk space to allocate for intermediate files."
}

input {
File bam
File bai
File? bed

String? chrom
Int bin_length = 1000
Boolean summarize_regions = false

# Runtime parameters
Int mem = 8
Int preemptible_tries = 3
Int extra_disk = 0
}
# mosdepth parameters
Int threads = 4
Boolean no_per_base = false
Boolean fast_mode = false
Int mapq = 1

# coverage_stats.py parameters
Int cov_col = 4 # column holding the coverage values
Int round = 2

# Calculate disk size
Int disk_size = (2 * ceil(size(bam, "GB") + size(bai, "GB"))) + 100 + extra_disk
String basename = basename(bam, ".bam")
String prefix = "~{basename}.coverage_over_bed"

String cov_file_to_summarize = if summarize_regions then "~{prefix}.regions.bed.gz" else "~{prefix}.per-base.bed.gz"

command {
set -euxo pipefail

# Create symbolic links for bam and bai in the current working directory
ln -s ~{bam} ./~{basename}.bam
ln -s ~{bai} ./~{basename}.bai

mosdepth \
~{true="-n" false="" no_per_base} \
~{true="-x" false="" fast_mode} \
-t ~{threads} \
~{"-c " + chrom} \
~{"-b " + select_first([bed, bin_length])} \
~{"-Q " + mapq} \
~{prefix} ./~{basename}.bam

# Run coverage_stats.py
python3 /coverage_stats.py \
--cov_col ~{cov_col} \
--round ~{round} \
--output_prefix ~{prefix} \
~{cov_file_to_summarize}
}

output {
# mosdepth output
File global_dist = "~{prefix}.mosdepth.global.dist.txt"
File summary = "~{prefix}.mosdepth.summary.txt"
File? per_base = "~{prefix}.per-base.bed.gz"
File region_dist = "~{prefix}.mosdepth.region.dist.txt"
File regions = "~{prefix}.regions.bed.gz"
File regions_csi = "~{prefix}.regions.bed.gz.csi"
# coverage_stats.py output
File cov_stat_summary_file = "~{prefix}.cov_stat_summary.json"
Map[String, Float] cov_stat_summary = read_json("~{prefix}.cov_stat_summary.json")

}

runtime {
memory: mem + " GiB"
disks: "local-disk " + disk_size + " HDD"
preemptible: preemptible_tries
maxRetries: 1
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:bs-cov-sum-0.3.1"
}
}

task MosDepthPerInterval {

meta {
description: "Calculate coverage using mosdepth for each interval in bed."
}
parameter_meta {
bam: "Aligned BAM file."
bai: "Aligned BAM index file."
bed: "BED file containing regions of interest."
bin_length: "Length of bins to use for coverage calculation. default is 1000. If bed file is provided, this parameter is ignored."
preemptible_tries: "Number of times to retry a preempted task."
extra_disk: "Extra disk space to allocate for intermediate files."
}

input {
File bam
File bai
File? bed

Int bin_length = 1000

# Runtime parameters
Int mem = 8
Int preemptible_tries = 3
Int extra_disk = 0
}
# mosdepth parameters
Int threads = 4
Boolean no_per_base = false
Boolean fast_mode = false
Int mapq = 1

# coverage_stats.py parameters
Int cov_col = 4 # column holding the coverage values
Int round = 2

# Calculate disk size
Int disk_size = (2 * ceil(size(bam, "GB") + size(bai, "GB"))) + 100 + extra_disk
String basename = basename(bam, ".bam")
String prefix = "~{basename}.coverage_over_bed"

String cov_file_to_summarize = "~{prefix}.per-base.bed.gz"

command <<<
set -euxo pipefail

# Create symbolic links for bam and bai in the current working directory
ln -s ~{bam} ./~{basename}.bam
ln -s ~{bai} ./~{basename}.bai

# Create file for coverage stats summary of all intervals
touch ~{prefix}.cov_stat_summary_all.json
echo "[" > ~{prefix}.cov_stat_summary_all.json

# Create a temporary directory for intermediate files
tmp_dir=$(mktemp -d)
trap "rm -rf $tmp_dir" EXIT

mapfile -t lines < "~{bed}"

for line in "${lines[@]}"; do
bed_file="$tmp_dir/bed_line.bed"
echo $line | tr ' ' '\t' > $bed_file

# Use samtools to extract reads overlapping the interval
samtools view -bM -L $bed_file ~{basename}.bam > $tmp_dir/interval.bam
samtools index $tmp_dir/interval.bam

mosdepth \
~{true="-n" false="" no_per_base} \
~{true="-x" false="" fast_mode} \
-t ~{threads} \
~{"-Q " + mapq} \
~{prefix} $tmp_dir/interval.bam

# if mosdepth fails, exit with failure
if [[ $? -ne 0 ]]; then
echo "Mosdepth failed for interval $line"
exit 1
fi

# Run coverage_stats.py
python3 /coverage_stats.py \
--debug \
--cov_col ~{cov_col} \
--round ~{round} \
--output_prefix ~{prefix} \
~{cov_file_to_summarize}

# In the coverage stats summary replace the open bracket with 'interval' name and the line as the value
sed -i "s/{/\{\"interval\": \"$line\", /" ~{prefix}.cov_stat_summary.json


# Append the coverage stats summary of the current interval to the file containing the summary of all intervals
if [[ -s ~{prefix}.cov_stat_summary.json ]]; then
cat ~{prefix}.cov_stat_summary.json >> ~{prefix}.cov_stat_summary_all.json
echo "," >> ~{prefix}.cov_stat_summary_all.json
fi

done
# remove the last comma and close the json array
sed -i '$ s/,$//' ~{prefix}.cov_stat_summary_all.json
echo "]" >> ~{prefix}.cov_stat_summary_all.json


>>>

output {
# coverage_stats.py output
File cov_stat_summary_all_file = "~{prefix}.cov_stat_summary_all.json"
# Array[Map[String, Object]] cov_stat_summary_all = read_json("~{prefix}.cov_stat_summary_all.json")

}

runtime {
memory: mem + " GiB"
disks: "local-disk " + disk_size + " HDD"
preemptible: preemptible_tries
maxRetries: 1
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:bs-cov-sum-0.3.1"
}
}


task BinBed {
input {
File? bed_file
Int? bin_length
}

command {
set -euxo pipefail

bedtools makewindows -b ~{bed_file} -w ~{bin_length} > binned.bed

}

output {
File binned_bed = "binned.bed"
}

runtime {
cpu: 1
memory: "1 GiB"
disks: "local-disk 10 HDD"
docker: "us.gcr.io/broad-dsp-lrma/lr-basic:latest"
}
}

task FailWorkflow {
input{
String message
}
command {
set -e

echo "Failing workflow"
echo ~{message}
exit 1
}
output {
String out_message = message
}
runtime {
docker: "marketplace.gcr.io/google/ubuntu2004"
}
}

Loading