Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions for GERP from bam files directly #85

Open
willright28 opened this issue Sep 27, 2024 · 14 comments
Open

Questions for GERP from bam files directly #85

willright28 opened this issue Sep 27, 2024 · 14 comments
Assignees
Labels
question Further information is requested

Comments

@willright28
Copy link

Hi @viklund and all contributors,

Thanks for this amazing workflow and I found it very useful! I was wondering if there is a way to use it more efficiently.

For example, I already have bam files for my samples, and I just want to use this workflow to get the GERP scores, how can I do this? And also I noticed the sample naming convention is very strict, requiring name+index+lane, what if only the name is availble and what should I do?

Thanks for your time!

Best,
willright28

@willright28
Copy link
Author

Another question: I noticed this workflow using docker to prepare the bioinformatics softwares, say bedtools, but docker service was blocked in my nation, can I instead prepare these softwares by myself?

@verku
Copy link
Collaborator

verku commented Oct 9, 2024

Hi willright28!

Thank you for reaching out, I'm happy to hear the pipeline is useful for you!

Yes, you can run the pipeline to only get GERP scores for the reference genome. To do that, you only add the path to your reference genome at the top of the config file, and then you scroll down to the GERP section where you set "gerp" to "True", add a path to a directory containing the genome assemblies of outgroup species in "gerp_ref_path", provide a tree (e.g. from timetree.org), and leave "minGERP" and "maxGERP" at the default values (these two parameters won't be used when only estimating GERP scores).

If you still want to process your re-sequencing data with the pipeline, you can invent an index and lane for each sample, e.g. "sampleA_01_L1", "sampleB_01_L1". The index and lane information is required for merging of BAM files and if there is only one BAM per sample, the merging is skipped.

Regarding your last question, Snakemake (the workflow manager used for this pipeline) runs singularity/apptainer and not Docker (even though Docker images are downloaded, but they are run in singularity/apptainer). Docker cannot be run on HPC clusters (for which GenErode was developed) while singularity/apptainer can, so I hope you have access to a system with singularity/apptainer so that you'll be able to run the pipeline. I would not recommend to try to install all the softwares yourself - you would have to go into the source code and change every single Snakemake rule to use existing software to do that and it's quite a long pipeline with many rules.

I hope this answers your questions, but don't hesitate to reach out again if anything is still unclear or if you have any additional questions!

Best,
Verena

@verku verku self-assigned this Oct 9, 2024
@verku verku added the question Further information is requested label Oct 9, 2024
@willright28
Copy link
Author

Hi Verena!

Thank you very much for your answer! It has solved some of my questions, but now I'm encountering new ones!

In short, by following your guidance, I've prepared the config file and renamed my re-sequencing data, and the dry run has passed without error message:
snakemake --cores 1 --use-singularity -np &> dry_run_test.out
But when it moved to the main run, I think the network problem might have prevented the pipeline from working well. My code is:
snakemake --cores 40 -j 100 --use-singularity &> main_run_gerp.out
and one of the error message is:
Pulling singularity image docker://biocontainers/samtools:v1.9-4-deb_cv1. WorkflowError: Failed to pull singularity image from docker://biocontainers/samtools:v1.9-4-deb_cv1: WARNING: Couldn't use cached digest for registry: Get "https://index.docker.io/v2/": dial tcp 31.13.91.33:443: i/o timeout WARNING: Falling back to direct digest.
Now I'm trying to pull and save docker images on my PC and then build them on the HPC cluster, a little complicated, though.

Please give suggestions if work can be simplified!

Thanks for your time and attention!

Best,
willright28

@verku
Copy link
Collaborator

verku commented Oct 10, 2024

Hi!

Great to hear that you got the dry run to work! Regarding your issues with the container image you should not use Docker or your local PC since it is most likely a Singularity issue on your HPC cluster. Have you tried to pull the image on the HPC, like this?

singularity pull docker://biocontainers/samtools:v1.9-4-deb_cv1

If successful, you should find a *.sif file in the directory where you ran the command with the Singularity image (created from the Docker image).

Best,
Verena

@willright28
Copy link
Author

Hi Verena!

Yes I have tried to pull the image, but message said:

singularity pull docker://biocontainers/samtools:v1.9-4-deb_cv1 WARNING: Couldn't use cached digest for registry: Get "https://index.docker.io/v2/": dial tcp 108.160.169.37:443: i/o timeout WARNING: Falling back to direct digest. FATAL: While making image from oci registry: error fetching image to cache: failed to get checksum for docker://biocontainers/samtools:v1.9-4-deb_cv1: Get "https://index.docker.io/v2/": dial tcp 31.13.96.193:443: i/o timeout

I'm afraid it's the network issue😂, instead the "oras://community.wave.seqera.io/library/seqtk:1.4--e75a8dec899d1be8" can be pulled.

Thanks very much!

Best,
willright28

@willright28
Copy link
Author

Update: "docker://quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0" can be pulled and I can find the .sif file lol.

So I cann't tell where is the problem...

@verku
Copy link
Collaborator

verku commented Oct 10, 2024

Hi!

Since other containers work, it sounds to me as if the container itself is causing a problem on your system! Samtools should be updated anyway to a new version, so if you can wait for a few weeks or so, I can update the container for the pipeline when I publish the next version of GenErode.

Otherwise, if you want to try to change the container yourself in the GenErode pipeline source code, you can try to replace the string "docker://biocontainers/samtools:v1.9-4-deb_cv1" with "https://depot.galaxyproject.org/singularity/samtools:1.20--h50ea8bc_1". It should be present in several of the rules files ending with *.smk in this folder of the pipeline source code: workflow/rules/.

I haven't had the time to test this container myself yet but I'm planning to use it for the next pipeline version and will run a few tests when I prepare the update.

@willright28
Copy link
Author

Hi!

Thanks for the guidance! I will manage to do it.

But for now, a bad news is "docker://nbisweden/generode-bwa:latest" failed 😂. I don't know if it works on your system.

Best,
willright28

@verku
Copy link
Collaborator

verku commented Oct 14, 2024

Hi!

This is our own container that worked on our old HPC cluster. We are currently moving to a new system but so far I haven't heard that anyone had problems with it. However, since there are public container images with the same software available, and we're planning to upgrade samtools (which is also installed in this container), I think it makes sense to replace the container in the next GenErode pipeline version.

I might use seqera to get a container with bwa and samtools. This is the website where you can combine different tools into a new container: https://seqera.io/containers/ And here is a link to a container image that I created with bwa 0.7.17 and samtools 1.20 that you can try: oras://community.wave.seqera.io/library/bwa_samtools:7a05dc7f6bd0f339

If this doesn't work, there is a way to find combinations of tools in containers via biocontainers.pro.

All the best,
Verena

@willright28
Copy link
Author

Hi!

Sorry for the late reply. I'm still not able to access to the docker container and use your pipeline. But the good new is I've tried to follow the GERP calculation steps your provided to get the GERP scores for my samples. Thanks for those detailed instructions!

Now I have a small question about the script usage. I noticed that some studies (von Seth, 2021, Sumatran rhinoceros) used a closely related species (white rhinoceros) as the reference to infer ancestral alleles and get GERP score, and you also mentioned this point in your article. I would like to ask where should I change the REF_NAME from study species (Sumatran rhinoceros) to its relative (white rhinoceros)?

I see three main steps to infer ancestral alleles and compute GERP score, and I think these are the key to avoid potential biases:

  1. Compute GERP score: gerpcol -t XX.tree -f XX.fa -e white rhinoceros XXX. I can ensure that I should point the relative in this step and this step will produce XX.fasta.rates.
  2. Convert GERP-scores to the correct genomic coordinates: gerp_to_position.py XX.fa XX.fasta.rates X_species. From this step, should I also use the white rhinoceros or the Sumatran rhinoceros?
  3. Get ancestral state: fasta_to_major_allele.py XX.fa X_species. And also this step.

Thanks very much!

Best,
willright28

@verku
Copy link
Collaborator

verku commented Nov 8, 2024

Hi!

Sorry to hear that the containers still don't work for you! I have an idea regarding our custom containers and will let you know if I can replace or update them.

Yes, it is recommended to use a closely related species to calculate mutational load from GERP scores. If you have access to the reference from a closely related species, you need to replace the Sumatran rhino reference genome (i.e. your target species) with the reference genome from White rhino (i.e. the closely related species) in all GERP-related steps of the analysis. So if you want to calculate mutational load from these GERP scores, you also need to map your re-sequencing data (from your target species) to the reference genome of the closely related species to do variant calling.

Best,
Verena

@willright28
Copy link
Author

Hi!

Got your point. I'll try to get the GERP score. And waiting for your good news!

Best,
willright28

@verku
Copy link
Collaborator

verku commented Nov 27, 2024

Hi @willright28!

A new GenErode version is available (0.6.2)! In this version I have exchanged all our custom containers but one for publicly available containers (e.g. from sequera) so I hope they will finally work on your system. The container that remains is the one for mlRho for which I was not able to find any public container with a helper tool that the pipeline needs.

Importantly, the new GenErode version contains some bug fixes to the GERP part of the pipeline. In the original code, sites were missing in the alignments that are now included, so there will most likely be more sites with GERP scores in the output from version 0.6.2 compared to that from earlier GenErode versions.

Let me know if you still run into any issues!

@willright28
Copy link
Author

willright28 commented Nov 28, 2024

Hi!

Thank you for letting me know about the software update, what an exciting news! I appreciate your efforts in updating GenErode! Can't wait to test it and will let you know if I encounter any issues.

Best,
willright28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants