-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a little cleanup #422
base: main
Are you sure you want to change the base?
a little cleanup #422
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tedsharpe , thanks for the cleanup!
I have some questions and think some small changes are needed.
Let me know!
BTW, the automated tests are catching these errors
https://github.com/broadinstitute/long-read-pipelines/actions/runs/6225233549/job/16895206594?pr=422#step:7:312
docker/lr-metrics/Dockerfile
Outdated
@@ -16,9 +16,10 @@ ENV PATH=/miniconda/bin/:/miniconda/envs/lr-metrics/bin/:/root/google-cloud-sdk/ | |||
|
|||
# install conda packages | |||
COPY ./environment.yml / | |||
RUN conda install -n base conda-libmamba-solver && conda config --set solver libmamba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with how mamba helps dependencies, in particular where it should be installed.
But if you look at the environment.yml, it's trying to create an env named lr-metrics
, and here you're installing mamba into the base
env.
Is this usually what people do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been reverted, so it's no longer relevant, but here's some info for posterity.
I tried to build this docker on a desktop with 16Gb of memory. No dice. If I was lucky it just got OOM'd after a few hours. So I ran out to microcenter and bought 32Gb of memory. After leaving it overnight it was still trying to solve the environment. Added libmamba and the docker built promptly (a few minutes--don't remember exactly) using little memory. Don't know about the correctness of the environments (but I'd think you'd want the solver in the base environment).
docker/lr-metrics/Dockerfile
Outdated
RUN conda env create -f /environment.yml && conda clean -a | ||
|
||
# install gatk | ||
# install super-special version of gatk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a ticket for this to help us track #427
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current version eliminates it.
@@ -86,7 +85,7 @@ task IsLocusDeleted { | |||
boot_disk_gb: 10, | |||
preemptible_tries: 2, | |||
max_retries: 1, | |||
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.1" | |||
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we have 0.3.2
on GCR, but given that the main
branch is using 0.3.1
, we can do one of two things here:
- not updating the version here
- find out what
0.3.2
changed from0.3.1
. I see you included several more packages inlr-mosdepth
, so if you built that, then let's also bump the version in the makefile in that docker foler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makefile version bumped.
preemptible_tries: 2, | ||
max_retries: 1, | ||
docker: "us.gcr.io/broad-dsp-lrma/lr-metrics:0.1.11" | ||
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, mosdepth's docker is being updated. Can we look into what changed in 0.3.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a couple of packages so that we didn't have to needlessly break this task in two.
String basename = basename(bam, ".bam") | ||
Int disk_size = 2*ceil(size(bam, "GB")) | ||
|
||
command <<< | ||
set -euxo pipefail | ||
|
||
java -jar /usr/local/bin/gatk.jar ComputeLongReadMetrics -I ~{bam} -O ~{basename}.read_metrics -DF WellformedReadFilter | ||
samtools flagstat ~{bam} > ~{basename}.flag_stats.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great.
Should it have its own task, though? I know it's always a balance (localization vs. separation of concerns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an astounding amount of overhead in invoking a task. Provisioning the VM, downloading the docker image, checking the call cache database, localizing input files, etc., etc., etc. It's especially true of small tasks. For example, in this workflow the task MakeChrIntervalList (an extreme example, I admit) runs for 2 minutes. Of this, the useful work part (UserAction) takes 1.36 seconds. I'm voting for a rule that says, "You combine all the operations that process the same data into a single task unless it's a very long-running task that might escape preemption or significantly reduce workflow real clock time by being parallelized."
docker/lr-metrics/Dockerfile
Outdated
@@ -30,6 +31,9 @@ RUN git clone https://github.com/broadinstitute/gatk.git -b kvg_pbeap \ | |||
# install picard | |||
RUN wget -O /usr/local/bin/picard.jar https://github.com/broadinstitute/picard/releases/download/2.22.1/picard.jar | |||
|
|||
# install gsutil | |||
RUN curl -sSL https://sdk.cloud.google.com | bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this command, just a convention in this repo.
We usually bump the version in the companion make file when the Dockerfile/environment file are changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted.
wdl/tasks/Utility/MergeVCFs.wdl
Outdated
|
||
import "../../structs/Structs.wdl" | ||
|
||
workflow MergeVCFs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this taken from the GATK-SV pipeline?
And is this integrated into other pipelines in this repo?
I checked out your branch and deleted that file, then went to validate all pipelines. All WDLs validated, so I suspect it is not integrated.
If it is indeed not integrated into other pipelines here, do you mind making a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was something I wrote for the GMKF data import. It doesn't belong in this PR. I'll remove it.
Thanks for the review, Steve. I think we're probably ready for another look. |
No description provided.