Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate RAM usage based on input filesize #44

Open
matthdsm opened this issue Mar 10, 2021 · 14 comments
Open

Estimate RAM usage based on input filesize #44

matthdsm opened this issue Mar 10, 2021 · 14 comments

Comments

@matthdsm
Copy link
Contributor

Hi,

I'm trying to find a way to get a rough estimate of how much ram I'll need to run elprep filter based on the size of the input bam.

Do you have any way of calculating this, e.g. for when submitting a job to some cloud provider?

Thanks
Matthias

@caherzee
Copy link
Contributor

Hi Matthias,

For elPrep 4, we made a predictor for peak RAM use based on a set of benchmark runs. More specifically, we made such a predictor for WGS data for the elPrep filter mode. This gave use the following equation for predicting the RAM use based on input BAM size: Y = 15X + 32.

This means ePrep 4 requires about 32 GB base memory + 15 times the input BAM size (in GB) for the elPrep filter mode, in the case of WGS data. For estimating the memory use for the sfm mode, you would need to look at the BAM size of the largest split file, which can vary for different data sets.

The numbers would look a bit different for WES data. We would also need to update the predictor for elPrep 5.

Does this help? Would it be useful to update a specific predictor for your use case?

Thanks!
Charlotte

@matthdsm
Copy link
Contributor Author

matthdsm commented Mar 11, 2021

Hi Charlotte,

Thanks! I ran the numbers and we're getting a bit different results. For an exome of about 8GB we see a RAM usage of about 300GB on average ( 3 tests, with 20, 40 and 80 threads). Anecdotally, the more threads we used, the lower the ram usage was (about 30gb difference between 20 and 80 threads).

An updated predictor would be most welcome!

cheers
Matthias

NB, command used was

elprep filter \
$1 \
${1%-sort.bam}.bam \
--nr-of-threads 20 \
--mark-duplicates \
--mark-optical-duplicates ${1%-sort.bam}_duplicate_metrics.txt \
--optical-duplicates-pixel-distance 2500 \
--sorting-order coordinate \
--haplotypecaller ${1%-sort.bam}.vcf.gz \
--reference /references/Hsapiens/hg38/seq/hg38.elfasta \
--target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed \
--log-path $PWD --timed
``

@caherzee
Copy link
Contributor

Hi Matthias,

I have made a preliminary predictor for elPrep 5 based on benchmarks for data samples we have at our lab: Y = 24X + 3. This, however, is quite far from the numbers you saw in your runs.

I have a couple of questions:

  • Would it be possible to do a run with BQSR included? BQSR smooths the quality scores and we have seen that removing the option can have an impact on the computational performance of the haplotype caller step, possibly increasing the RAM use.
  • Did you compile the elPrep binary yourself? If so, which version of the Go compiler was used? Or did you download the binary from our website? If so, which version did you test?
  • Would it be possible to send us the log files of the elprep runs?
  • Would it be possible for us to get access to your data sample so that we can do some tests ourselves?

Thanks a lot!

Best,
Charlotte

@matthdsm
Copy link
Contributor Author

Hi Charlotte,

  • I'll rerun my testcase with BQSR enabled and keep you posted.
  • We're using the precompiled binary from the website, version 5.0.1
  • I'll e-mail you the logs
  • I'll see what I can do with the data. Perhaps I can retry using a GiaB sample, which is beter for sharing.

I'll keep you posted!

Matthias

@matthdsm

This comment has been minimized.

@matthdsm
Copy link
Contributor Author

matthdsm commented Mar 14, 2021

A quick test with BQSR on 80 threads reduces RAM usage by about 20GB (270GB total), so you were right about it having an effect on the requirements!
Matthias

edit: removed off topic remark

@pcostanza
Copy link
Contributor

@matthdsm I opened new issues for your two side notes. I hope you have been notified by my answers there.

Thanks,
Pascal

@matthdsm
Copy link
Contributor Author

Duly noted!

@matthdsm
Copy link
Contributor Author

matthdsm commented Jul 8, 2021

Hi Charlotte,

Are there any updates wrt to the RAM usage estimate?

Thanks
M

@caherzee
Copy link
Contributor

caherzee commented Jul 8, 2021

Hi Matthias,

We had a last e-mail exchange to get access to a data file on March 17th. As far as I know, there was never a reply?

Thanks!
Charlotte

@matthdsm
Copy link
Contributor Author

matthdsm commented Jul 8, 2021

Right, I lost sight on what had already been done. Let me get back to you!

M

@matthdsm
Copy link
Contributor Author

Hi Charlotte,

To get back to this, which compression level do you use for your test input data? That might be the reason your formula doesn't work for our data. Since the input bam is intermediate data, we only use fast compression (e.g. samtools view -1) to save on time, which results in a bigger bam file.

On a related note, what compression level do you use for the output bams? I noticed the output bam is larger than the input, which usually isn't the case when the data is sorted.
M

@caherzee
Copy link
Contributor

caherzee commented Oct 22, 2021 via email

@matthdsm
Copy link
Contributor Author

I ran some tests on our infrastructure and came up with Y = 34X + 20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants