The goal of this session is to be able to locate the functional regions of a given genome for which we don't have any previous annotation.
Work alone or in pairs during this lab. As you progress, take notes of answers to the questions highlighted on this protocol on a text document or on a paper sheet. At the end of the lab, you should approach one of the teaching assistants present, and ask them to check your answers. They will discuss incorrect answers with you. Once the assistants have approved your answers, they will mark your attendance and the practical session is considered completed. If there is not time for you to present your answers, you can submit them via Studium.
For this session and the next, we are going to work with samples retrieved from real patients in the context of an Escherichia coli outbreak. The files we are going to work with contain the complete sequence of a plasmid. Plasmids are small sequences of DNA present in some bacteria outside the circular chromosome. Bacteria can share them via horizontal gene transfer in a process called "conjugation" and this plasmids have a huge influence on virulence, pathogenicity and antibiotic resistance. With that in mind, we want to fully characterize these sequences, in order to inform the doctors of how to properly combat this new outbreak of EHEC.
Steps of the bacterial conjugationChoose two of the available sequences in DATA/Lab3/
and copy them into your personal folder.
Gene finding is a central aspect of bioinformatics, and a key issue when trying to make sense out of new and/or unknown sequence such as our plasmid. There are several different ways to do this and what is best will depend on what genome is being analyzed, especially whether it is prokaryotic or eukaryotic. Methods range from finding stretches without stop codons, via homology comparisons with other, closely related sequences, to highly sophisticated neural networks. Gene finding in eukaryotes is particularly difficult and typically requires cross-validation between several complex methods. In prokaryotes, though, it is quite more straightforward. In most cases a simple search for open reading frames (ORFs) is usually sufficient to get a good set of candidate genes for further analysis.
The most simple and naïve way to search for genes is to search your input sequence for the initiation codon ATG followed by a stretch of DNA free from an in-frame stop codon. Since almost all gene products are larger than 50 amino acids, it is reasonable to set the minimum length of your ORFs to 150 nucleotides.
You will deal with the program getorf
in the EMBOSS suite (EMBOSS suite.pdf) In order to use getorf with a minimum ORF size of 150 nucleotides and a search from ATG to stop, type
getorf YOUR_FILE.fna -minsize 150 -find 3
on the command line. Genes might be located on the opposite strand to the sequenced one, but this is not a problem since getorf will still find it and note the direction of the gene in its output. Note that STOP codons are not included in the output. That issue, however, will be fixed by the script that parses program output to a format that can be used in the visualization tool Artemis:
perl ./SRC/GenePrediction2Artemis.pl -i getorf.file -o outputname.artemis -getorf
Now, run getorf on your plasmid. Save the result.
There are of course a number of ways to increase the precision of the simple ORF search. You could analyze G+C content, alternate start codons, base composition biases and much more. There is a software suite called GeneMark, available both online and as standalone binaries, that does just that.
There are a number of ways to use GeneMark to predict ORFs. They all have in common that they need to be trained however. The simplest program is called just GeneMark (gm
on the
command line). And it uses an extended ORF search algorithm to find ORFs.
Hidden Markov model
The extended ORF search is pretty good, but still it is based on looking in the sequence for known features. Sometimes you want to find a gene that is a pseudo-gene or in any other way lacks features like start and stop codons ordinarily needed to identify a gene. In this case it is very useful to utilize methods that are trained on a set of known genes to achieve a pattern of how the bases of a typical gene look like. Thereafter it should try to find other stretches of the sequence that has the same parameter values as a typical gene. The program GeneMark.hmm
does just that. But it of course requires a model that has been generated by some training on known genes in the genome.
What if you don't have a training set but still thinks the HMM approach would be good? Would it still
be possible to do that? Sure! By starting with an ab initio method such as getorf you can get the
starting point necessary to use trained models. There is a third variant of GeneMark
called GeneMarkS
that you can use for training. The nice thing about this program is that you can train it on an entire genome in which you have not predicted any ORFs yet. Lucky for us, GeneMark2, the newest version, have all this steps as a pipeline, so we can run it all in one command.
The first thing we need to do is untar (decompress) the folder in which the program is: `tar -xvf ./SRC/Lab3/gms2_linux_64.tar.gz
gunzip ./SRC/Lab3/gm_key_64.gz`
If you have your new plasmid sequence in the file plasmid.fna
and want to predict some ORFs in it, you can do it like this:
-
Generate the HMM by training on the genome with
gmsn.pl
. This will generate a new file calledGeneMark_hmm.mod
. In order for this command to work, you need to execute this command:cp ./SRC/gm_key_64 ~/.gmhmmp2_key
-
Now we are ready to search for orfs in the genome:
perl ./SRC/gms2_linux_64/gms2.pl --seq YOUR_FILE.fna --genome-type bacteria --output SOME_OUTPUT_NAME.lst
**Exercise: run the Genemark suite on the provided genome. Save the result and run the conversion script we also used with getorf to change the format. **
In order to visualize the differences between putative ORFs, there is a tool called Artemis
that may be of use. If you have JAVA installed, you could use a web based launch from the same place. In order to convert the outputs from getorf
and GeneMarkS
to an Artemis format you need a script that can parse the program output into a format recognized by Artemis
. In Artemis
you can load in the sequence file and then on top of it display different feature files, in the following format (NB The number of spaces are important):
FT CDS 1000..1500
FT /color='5'
FT CDS complement(11000..10000)
FT /color='6'
which tells Artemis to color bases between 1000 and 1500 cyan (5) and bases between 10000 and 11000 on the complementary strand magenta (6).
To start Artemis on Linux, type
art
In Windows or MacOS, do as prompted on the Artemis web page.
In Artemis, click "Open ...", then choose your genome file (usually the *.fna). In the new window, click "File", "Read an Entry ..." and then choose "YourArtemisFile" to see that gene search in the genome. Repeat this process for all the gene searches you want to compare. If you want to see them on separate lines, click "Display" and check the box in front of "Show One Line Per Entry View" .
There are three basic sections in the main window of Artemis
. The first section is an overview of the sequence and features. The middle grey lines represent the + and the - strand. Above and below the strands are three lines, representing the six codon frames(+1,+2,+3 above; -1,-2,-3 below); this is where you see the coding features (since they are necessarily in one of the codon frames). The second section is a close-in look at the sequence, with the amino acid codes in all six frames. The third section is a list of all features. Under the Display menu, you can hide/show the lower sections (same as clicking the small double-arrows at the top left of each section).
Each black horizontal line in the frame lines of the overview section represents a potential stop codon. This information can be useful for finding coding genes, since coding genes are found in long stretches without stop codons in a particular frame. If you want to hide the stop codons, right-click in the overview section and uncheck the box "Stop Codons".
If you click a feature, this feature is selected. Selected feature are indicated by bold black borders. You can select more than one feature by keeping SHIFT down and clicking one feature after the other. To select many features, you might want to use the menus. For example, to select all CDS (coding genes) features, choose click "Select->All CDS Features".
If you want to select a specific region (and not a particular feature), just click and drag your mouse on either the + or the - strand. This will highlight the selected region.
Exercise : open the provided genome and the gene predictions and answer the questions.
If you want to view the DNA sequence of a gene, do not use the visible sequence line that Artemis shows. This is bugged! In order to correctly view the DNA sequence of the gene, right click the CDS and then choose "View->Bases of Selection".
If part of the display disappeared and the features seem to move, which is rather inconvenient for a viewing program(!), the way to check gene sequence is the following. Right click a feature, then choose "View->Bases of Selection".
Use the naive ORF finding tool (getorf
) and the combined approach tool (GeneMarkS
) as outlined above.
Use your knowledge about the strengths and weaknesses of the methods to answer the following questions.
How many putative genes do each method find?
Look at the features (here a feature is a putative gene ) . Did both methods found the same genes? Is there any difference? Choose 3 examples and explain why you believe the predictions of both programs differ.