Session 3 - Identification of Coding Sequences

Goals of the practical session

The goal of this session is to be able to locate the functional regions of a given genome for which we don't have any previous annotation.

Assessment of the practical session:

Work alone or in pairs during this lab. As you progress, take notes of answers to the questions highlighted on this protocol on a text document or on a paper sheet. At the end of the lab, you should approach one of the teaching assistants present, and ask them to check your answers. They will discuss incorrect answers with you. Once the assistants have approved your answers, they will mark your attendance and the practical session is considered completed. If there is not time for you to present your answers, you can submit them via Studium.

The problem:

For this session and the next, we are going to work with samples retrieved from real patients in the context of an Escherichia coli outbreak. The files we are going to work with contain the complete sequence of a plasmid. Plasmids are small sequences of DNA present in some bacteria outside the circular chromosome. Bacteria can share them via horizontal gene transfer in a process called "conjugation" and this plasmids have a huge influence on virulence, pathogenicity and antibiotic resistance. With that in mind, we want to fully characterize these sequences, in order to inform the doctors of how to properly combat this new outbreak of EHEC.

Steps of the bacterial conjugation

Choose two of the available sequences in DATA/Lab3/ and copy them into your personal folder.

Gene identification tools

Gene finding is a central aspect of bioinformatics, and a key issue when trying to make sense out of new and/or unknown sequence such as our plasmid. There are several different ways to do this and what is best will depend on what genome is being analyzed, especially whether it is prokaryotic or eukaryotic. Methods range from finding stretches without stop codons, via homology comparisons with other, closely related sequences, to highly sophisticated neural networks. Gene finding in eukaryotes is particularly difficult and typically requires cross-validation between several complex methods. In prokaryotes, though, it is quite more straightforward. In most cases a simple search for open reading frames (ORFs) is usually sufficient to get a good set of candidate genes for further analysis.

Simple ORF search

The most simple and naïve way to search for genes is to search your input sequence for the initiation codon ATG followed by a stretch of DNA free from an in-frame stop codon. Since almost all gene products are larger than 50 amino acids, it is reasonable to set the minimum length of your ORFs to 150 nucleotides.

You will deal with the program getorf in the EMBOSS suite (EMBOSS suite.pdf) In order to use getorf with a minimum ORF size of 150 nucleotides and a search from ATG to stop, type

getorf YOUR_FILE.fna -minsize 150 -find 3

on the command line. Genes might be located on the opposite strand to the sequenced one, but this is not a problem since getorf will still find it and note the direction of the gene in its output. Note that STOP codons are not included in the output. That issue, however, will be fixed by the script that parses program output to a format that can be used in the visualization tool Artemis:

perl ./SRC/GenePrediction2Artemis.pl -i getorf.file -o outputname.artemis -getorf

Now, run getorf on your plasmid. Save the result.

A more sophisticated method

There are of course a number of ways to increase the precision of the simple ORF search. You could analyze G+C content, alternate start codons, base composition biases and much more. There is a software suite called GeneMark, available both online and as standalone binaries, that does just that. There are a number of ways to use GeneMark to predict ORFs. They all have in common that they need to be trained however. The simplest program is called just GeneMark (gm on the command line). And it uses an extended ORF search algorithm to find ORFs.

Hidden Markov model

The extended ORF search is pretty good, but still it is based on looking in the sequence for known features. Sometimes you want to find a gene that is a pseudo-gene or in any other way lacks features like start and stop codons ordinarily needed to identify a gene. In this case it is very useful to utilize methods that are trained on a set of known genes to achieve a pattern of how the bases of a typical gene look like. Thereafter it should try to find other stretches of the sequence that has the same parameter values as a typical gene. The program GeneMark.hmm does just that. But it of course requires a model that has been generated by some training on known genes in the genome.

Training

What if you don't have a training set but still thinks the HMM approach would be good? Would it still be possible to do that? Sure! By starting with an ab initio method such as getorf you can get the starting point necessary to use trained models. There is a third variant of GeneMark called GeneMarkS that you can use for training. The nice thing about this program is that you can train it on an entire genome in which you have not predicted any ORFs yet. Lucky for us, GeneMark2, the newest version, have all this steps as a pipeline, so we can run it all in one command.

Example usage of the GeneMark suite

The first thing we need to do is untar (decompress) the folder in which the program is: `tar -xvf ./SRC/Lab3/gms2_linux_64.tar.gz

gunzip ./SRC/Lab3/gm_key_64.gz`

If you have your new plasmid sequence in the file plasmid.fna and want to predict some ORFs in it, you can do it like this:

Generate the HMM by training on the genome with gmsn.pl. This will generate a new file called GeneMark_hmm.mod. In order for this command to work, you need to execute this command:

cp ./SRC/gm_key_64 ~/.gmhmmp2_key
Now we are ready to search for orfs in the genome:
perl ./SRC/gms2_linux_64/gms2.pl --seq YOUR_FILE.fna --genome-type bacteria --output SOME_OUTPUT_NAME.lst

**Exercise: run the Genemark suite on the provided genome. Save the result and run the conversion script we also used with getorf to change the format. **

Visualization

In order to visualize the differences between putative ORFs, there is a tool called Artemis that may be of use. If you have JAVA installed, you could use a web based launch from the same place. In order to convert the outputs from getorf and GeneMarkS to an Artemis format you need a script that can parse the program output into a format recognized by Artemis. In Artemis you can load in the sequence file and then on top of it display different feature files, in the following format (NB The number of spaces are important):

FT CDS 1000..1500   
FT /color='5'   
FT CDS complement(11000..10000)   
FT /color='6'

which tells Artemis to color bases between 1000 and 1500 cyan (5) and bases between 10000 and 11000 on the complementary strand magenta (6).

To start Artemis on Linux, type art

In Windows or MacOS, do as prompted on the Artemis web page.

In Artemis, click "Open ...", then choose your genome file (usually the *.fna). In the new window, click "File", "Read an Entry ..." and then choose "YourArtemisFile" to see that gene search in the genome. Repeat this process for all the gene searches you want to compare. If you want to see them on separate lines, click "Display" and check the box in front of "Show One Line Per Entry View" .

Display options

There are three basic sections in the main window of Artemis. The first section is an overview of the sequence and features. The middle grey lines represent the + and the - strand. Above and below the strands are three lines, representing the six codon frames(+1,+2,+3 above; -1,-2,-3 below); this is where you see the coding features (since they are necessarily in one of the codon frames). The second section is a close-in look at the sequence, with the amino acid codes in all six frames. The third section is a list of all features. Under the Display menu, you can hide/show the lower sections (same as clicking the small double-arrows at the top left of each section).

Fig.2 - Artemis interface.

Each black horizontal line in the frame lines of the overview section represents a potential stop codon. This information can be useful for finding coding genes, since coding genes are found in long stretches without stop codons in a particular frame. If you want to hide the stop codons, right-click in the overview section and uncheck the box "Stop Codons".

Selecting features and regions

If you click a feature, this feature is selected. Selected feature are indicated by bold black borders. You can select more than one feature by keeping SHIFT down and clicking one feature after the other. To select many features, you might want to use the menus. For example, to select all CDS (coding genes) features, choose click "Select->All CDS Features".

If you want to select a specific region (and not a particular feature), just click and drag your mouse on either the + or the - strand. This will highlight the selected region.

Exercise : open the provided genome and the gene predictions and answer the questions.

Viewing bug

If you want to view the DNA sequence of a gene, do not use the visible sequence line that Artemis shows. This is bugged! In order to correctly view the DNA sequence of the gene, right click the CDS and then choose "View->Bases of Selection".

If part of the display disappeared and the features seem to move, which is rather inconvenient for a viewing program(!), the way to check gene sequence is the following. Right click a feature, then choose "View->Bases of Selection".

Questions

Use the naive ORF finding tool (getorf) and the combined approach tool (GeneMarkS) as outlined above.

Use your knowledge about the strengths and weaknesses of the methods to answer the following questions.

Question 1

How many putative genes do each method find?

Question 2

Look at the features (here a feature is a putative gene ) . Did both methods found the same genes? Is there any difference? Choose 3 examples and explain why you believe the predictions of both programs differ.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab_3.md

Lab_3.md

Session 3 - Identification of Coding Sequences

Goals of the practical session

Assessment of the practical session:

The problem:

Gene identification tools

Simple ORF search

A more sophisticated method

Hidden Markov model

Training

Example usage of the GeneMark suite

Visualization

Display options

Selecting features and regions

Viewing bug

Questions

Question 1

Question 2

Files

Lab_3.md

Latest commit

History

Lab_3.md

File metadata and controls

Session 3 - Identification of Coding Sequences

Goals of the practical session

Assessment of the practical session:

The problem:

Gene identification tools

Simple ORF search

A more sophisticated method

Hidden Markov model

Training

Example usage of the GeneMark suite

Visualization

Display options

Selecting features and regions

Viewing bug

Questions

Question 1

Question 2