Since the eve of Biology as a field of Science, one of the key questions has been how do the different organisms we see today relate to each other, and how they evolved. In the old days, we used anatomical similarities (also known as homologies) and dissimilarities to try to reconstruct their evolutionary history.
With the dawn of genetic sequencing and the genomic era, we can now establish those relationships with quite more certainty and in a less biased way. Using genetic data for inferring thees relationships is known as Phylogenetics, and it is the "gold-standard" to stablish the relationship between modern species.
The basic idea behind it all is quite simple: as species diverge over time, they accumulate mutations that the other groups don't share. So, when comparing several sequences, the bigger the number of differences between them, the larger the time since their common ancestor. However, this simple idea gets complicated quite soon, as we are working with really long sequences and, in some cases, long periods of time. This means that we will need to use robust statistical modeling in order to infer these relationships, that we will represent as a phylogenetic tree.
Phylogenetic tree from Ersmark et al. 2016: https://doi.org/10.3389/fevo.2016.00134
Each tree is a hypothesis of the relationship between our sequences, and our goal is to identify, from all the possible trees, the one that is most likely to be true according to our data. This may vary depending on the region you are looking at, the models that you use or how you preprocess and align your sequences.
So, with this in our minds, let's get going.
- Align our sequences from Session 6
- Test which substitution model works better with our data
- Work with IQTree and learn how to extract information from its output
- Create a phylogenetic tree meaningful for our project's question
- Fasta sequences of the complete mitochondrial DNA and CytB that we curated on Lab 6
- Multiple Alignment of the complete mitochondria
- Multiple Alignment of the CytB
- IQTree file with relevant info on out tree
- Tree file
For this Session, we are going to use the files that we created in the previous one. Make sure you followed the instructions properly and that you have all the files located.
It's also a good idea to use the script you created in session 6 to change to one of the shorter names in your fastafile!
The first step is to do a Multiple Alignment. As our mitochondrial sequences are considerably bigger than the single gene ones, we will submit them as an independent job to Rackham, as we did in Session 5. However, today we are going to use the other method to submit a job using sbatch
:
a script.
It should look like this:
#!/bin/bash -l
#SBATCH -A uppmax2022-2-2
#SBATCH -p core
#SBATCH -n 1
#SBATCH -t 45:00
#SBATCH -J align_mt
#SBATCH -o align_mt.output
#SBATCH -e align_mt.output
#SBATCH --mail-user youremailforUppmax
#SBATCH --mail-type=END,FAIL
module load bioinfo-tools MAFFT/7.407
YOUR MAFFT COMMAND
exit 0
Open nano (or another text editor on Uppmax) and paste in the text above (with the correct changes of course) then save it to a good filename, such as run_mafft.sh
.
.sh
is often use for shell scripts, such as sbatch.
Now, look at the sbatch
bits. Look familiar? Yes, they are the same (almost) parameters that we used in Session 5, but this time we are imputing them inside the script, instead of specifying them when we call sbatch
. Remember to change youremailforUppmax and YOUR MAFFT COMMAND for your own email and command, respectively.
#SBATCH -o align_mt.output
#SBATCH -e align_mt.output
Will tell slurm to take anything that would normally be printed to the screen and save it to the file align_mt.output
.
IF you want to reuse this script later it might be a good idea to change this variable.
Now, in order to submit the job, just use this code:
sbatch -M snowy SCRIPT_NAME
If you want to check that your job has been submitted and in which state it is, use this command:
jobinfo -M snowy -u YOUR_UPPMAX_USER
Comment: if the job has finished, you won't see anything. In that case, check your emails!
Now open an interactive session and run the same mafft
command with your CytB sequences. You should thus not submit this as an sbatch
-job as it will proably take less than a minute to run.
Once we have the alignment, we can proceed to inferring which of all the possible trees is the most likely. We have several methods to do this:
- Parsimony: "the simplest explanation that can explain the data is to be preferred", so the hypothesis with the smallest number of changes is the most likely. However, this method has plenty of assumptions that we know are false, so it is not used anymore.
- Neighbour-joining: A slightly more refined version of parsimony in which we chose the best tree by minimizing branch lengths in the tree. More computationally intensive than parsimony, but still something that a modern computer can do fairly quickly.
- Maximum Likelihood: "Likelihood is defined to be a quantity proportional to the probability of observing the data given the model". This means that, by providing a model of how DNA sequences change, we can determine which tree is the most probable to be true.
- Bayesian Inference: This method uses Bayesian Statistics to combine prior information that we know about our data (also known as Prior Probability Distribution) with the likelihood, in order to transform it into a more accurate probability distribution, known as the Posterior.
Prior, likelihood and posterior distribution for a two-parameter phylogenetic example in Nascimento et al. 2017: https://dx.doi.org/10.1038%2Fs41559-017-0280-x
The last two are the state-of-the-art methods for phylogenetic analysis, and have become more and more popular as computing power has increased, as both methods are very demanding in that regard.
For our project, we are going to use an implementation of the Maximum Likelihood approach called IQ-TREE. This software offers several methods to speed up the analysis. To load the module:
module load iqtree/1.6.5-omp-mpi
As we mentioned earlier, any Maximum Likelihood approach is based on a model. In phylogenetics, this model describes the probability of each substitution to happen. Here you can find a list of the more common models, and here the ones that are implemented in IQ-TREE.
Graphical representation of some substitution models from Yang & Rannala, 2012. Nature Reviews Genetics: https://doi.org/10.1038/nrg3186
Now that we have a small picture of what we are doing, lets start working with IQ-TREE. The basic syntax for this software is:
iqtree -s ALIGNMENT -o OUTGROUP -m MODEL -pre OUTPUT_PREFIX -bb 1000
Under OUTGROUP you should put the name of your outrgroup as they appear in the alignment file. If you have multiple outgroups you can separate them with a comma (no spaces!) eg;
-o c_Vurs,H_sap
Now run IQ-TREE in your interactive session with the CytB data, and set your model to -m MFP. MFP stands for ModelFinder Plus, and is an algorithm that automatically considers a list of substitution models and estimates which is the one that fits our data better. -bb 1000 means that we want our algorithm to use bootstrapping. Once the alignment of the whole mitochondrial dataset is done, run IQ-TREE on that dataset via sbatch
. Remember to adapt the script above to run IQ-TREE (keep the module bioinfo-tools) and be careful to not over-write your files.
All the questions below refer only to the CytB output.
Question 1: Which files do IQ-TREE output? Explain briefly what each of them is.
Now let's look at the .iqtree file.
Question 2: Which model did ModelFinder choose? From all the criteria calculated by this software, which was used to determine the best-fitting model?
Question 3: Briefly explain the best-fitting model.
Question 4: Now look at both your Maximum Likelihood tree and Consensus Tree. Are they the same? If not, where do they differ?
Question 5: In both trees you can see a number at the base of each branch. That is the number of iterations that supported that branching during bootstrapping. Which is your least supported branch? What does that mean to your question?
Submit a file with the answers to all the questions and the .iqtree file for the CytB run.
This is the end of the lab, please make sure that you completed and wrote down the answers to all of the questions.
Upload the scripts (code) that you were asked to submit to studium in the original format (i.e. .py or .sh), no pdf
or word files! Any answers that are not code should of course be in text formats such as .pdf, .txt & .docx
.
Also, make sure to delete any files that you no longer need - you can copy them somewhere else if you want to keep them. This goes for both the Unix computers and Uppmax.