-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions for GERP from bam files directly #85
Comments
Another question: I noticed this workflow using docker to prepare the bioinformatics softwares, say bedtools, but docker service was blocked in my nation, can I instead prepare these softwares by myself? |
Hi willright28! Thank you for reaching out, I'm happy to hear the pipeline is useful for you! Yes, you can run the pipeline to only get GERP scores for the reference genome. To do that, you only add the path to your reference genome at the top of the config file, and then you scroll down to the GERP section where you set "gerp" to "True", add a path to a directory containing the genome assemblies of outgroup species in "gerp_ref_path", provide a tree (e.g. from timetree.org), and leave "minGERP" and "maxGERP" at the default values (these two parameters won't be used when only estimating GERP scores). If you still want to process your re-sequencing data with the pipeline, you can invent an index and lane for each sample, e.g. "sampleA_01_L1", "sampleB_01_L1". The index and lane information is required for merging of BAM files and if there is only one BAM per sample, the merging is skipped. Regarding your last question, Snakemake (the workflow manager used for this pipeline) runs singularity/apptainer and not Docker (even though Docker images are downloaded, but they are run in singularity/apptainer). Docker cannot be run on HPC clusters (for which GenErode was developed) while singularity/apptainer can, so I hope you have access to a system with singularity/apptainer so that you'll be able to run the pipeline. I would not recommend to try to install all the softwares yourself - you would have to go into the source code and change every single Snakemake rule to use existing software to do that and it's quite a long pipeline with many rules. I hope this answers your questions, but don't hesitate to reach out again if anything is still unclear or if you have any additional questions! Best, |
Hi Verena! Thank you very much for your answer! It has solved some of my questions, but now I'm encountering new ones! In short, by following your guidance, I've prepared the config file and renamed my re-sequencing data, and the dry run has passed without error message: Please give suggestions if work can be simplified! Thanks for your time and attention! Best, |
Hi! Great to hear that you got the dry run to work! Regarding your issues with the container image you should not use Docker or your local PC since it is most likely a Singularity issue on your HPC cluster. Have you tried to pull the image on the HPC, like this?
If successful, you should find a *.sif file in the directory where you ran the command with the Singularity image (created from the Docker image). Best, |
Hi Verena! Yes I have tried to pull the image, but message said:
I'm afraid it's the network issue😂, instead the "oras://community.wave.seqera.io/library/seqtk:1.4--e75a8dec899d1be8" can be pulled. Thanks very much! Best, |
Update: "docker://quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0" can be pulled and I can find the .sif file lol. So I cann't tell where is the problem... |
Hi! Since other containers work, it sounds to me as if the container itself is causing a problem on your system! Samtools should be updated anyway to a new version, so if you can wait for a few weeks or so, I can update the container for the pipeline when I publish the next version of GenErode. Otherwise, if you want to try to change the container yourself in the GenErode pipeline source code, you can try to replace the string "docker://biocontainers/samtools:v1.9-4-deb_cv1" with "https://depot.galaxyproject.org/singularity/samtools:1.20--h50ea8bc_1". It should be present in several of the rules files ending with I haven't had the time to test this container myself yet but I'm planning to use it for the next pipeline version and will run a few tests when I prepare the update. |
Hi! Thanks for the guidance! I will manage to do it. But for now, a bad news is "docker://nbisweden/generode-bwa:latest" failed 😂. I don't know if it works on your system. Best, |
Hi! This is our own container that worked on our old HPC cluster. We are currently moving to a new system but so far I haven't heard that anyone had problems with it. However, since there are public container images with the same software available, and we're planning to upgrade samtools (which is also installed in this container), I think it makes sense to replace the container in the next GenErode pipeline version. I might use seqera to get a container with bwa and samtools. This is the website where you can combine different tools into a new container: https://seqera.io/containers/ And here is a link to a container image that I created with bwa 0.7.17 and samtools 1.20 that you can try: oras://community.wave.seqera.io/library/bwa_samtools:7a05dc7f6bd0f339 If this doesn't work, there is a way to find combinations of tools in containers via biocontainers.pro. All the best, |
Hi! Sorry for the late reply. I'm still not able to access to the docker container and use your pipeline. But the good new is I've tried to follow the GERP calculation steps your provided to get the GERP scores for my samples. Thanks for those detailed instructions! Now I have a small question about the script usage. I noticed that some studies (von Seth, 2021, Sumatran rhinoceros) used a closely related species (white rhinoceros) as the reference to infer ancestral alleles and get GERP score, and you also mentioned this point in your article. I would like to ask where should I change the REF_NAME from study species (Sumatran rhinoceros) to its relative (white rhinoceros)? I see three main steps to infer ancestral alleles and compute GERP score, and I think these are the key to avoid potential biases:
Thanks very much! Best, |
Hi! Sorry to hear that the containers still don't work for you! I have an idea regarding our custom containers and will let you know if I can replace or update them. Yes, it is recommended to use a closely related species to calculate mutational load from GERP scores. If you have access to the reference from a closely related species, you need to replace the Sumatran rhino reference genome (i.e. your target species) with the reference genome from White rhino (i.e. the closely related species) in all GERP-related steps of the analysis. So if you want to calculate mutational load from these GERP scores, you also need to map your re-sequencing data (from your target species) to the reference genome of the closely related species to do variant calling. Best, |
Hi! Got your point. I'll try to get the GERP score. And waiting for your good news! Best, |
Hi @willright28! A new GenErode version is available (0.6.2)! In this version I have exchanged all our custom containers but one for publicly available containers (e.g. from sequera) so I hope they will finally work on your system. The container that remains is the one for mlRho for which I was not able to find any public container with a helper tool that the pipeline needs. Importantly, the new GenErode version contains some bug fixes to the GERP part of the pipeline. In the original code, sites were missing in the alignments that are now included, so there will most likely be more sites with GERP scores in the output from version 0.6.2 compared to that from earlier GenErode versions. Let me know if you still run into any issues! |
Hi! Thank you for letting me know about the software update, what an exciting news! I appreciate your efforts in updating GenErode! Can't wait to test it and will let you know if I encounter any issues. Best, |
Hi @viklund and all contributors,
Thanks for this amazing workflow and I found it very useful! I was wondering if there is a way to use it more efficiently.
For example, I already have bam files for my samples, and I just want to use this workflow to get the GERP scores, how can I do this? And also I noticed the sample naming convention is very strict, requiring name+index+lane, what if only the name is availble and what should I do?
Thanks for your time!
Best,
willright28
The text was updated successfully, but these errors were encountered: