Skip to content

ViennaRNA/RNAz

Repository files navigation

RNAz  2.1
============

Content
=======

0. Terms of use
1. Introduction
2. Theory
   2.1. The structure conservation index
   2.2. Thermodynamic stability
   2.3. Classification based on both scores
3. Usage
   3.1. Installation
   3.2. Enviroment variable
   3.3. Invocation
   3.4. Output
4. Citing RNAz
5. Important notes
6. Contact

0. Terms of use
===============

Please read the file COPYING for licence terms of RNAz. 

1. Introduction
===============

RNAz detects stable and conserved RNA secondary structures in multiple
sequence alignments. RNAz calculates two independent scores for structural
conservation (the structure conservation index SCI) and for thermodynamical
stability (the z-score). High structural conservation (high SCI) and
thermodynamical stability (negative z-scores) are typical features of
functional RNAs (e.g. non-coding RNAs or cis-acting regulatory
elements). RNAz uses both scores to classify a given alignment as
functional RNA or not. It uses a support vector machine classification
procedure which estimates a RNA-class probability which can be used as
convenient overall-score.

For a detailed coverage of all aspects of RNAz we recommend to read the
manual/tutorial (manual.pdf).


2. Theory
=========

2.1. The structure conservation index
=====================================

RNAz uses programs from the Vienna RNA package to perform minimum free
energy (MFE) RNA secondary structure predictions.

First, it calculates the average MFEs for all single sequences in the
aligment using RNAfold.

Second the complete alignment is folded using RNAalifold. RNAalifold
implements a consensus folding algorithm which uses essentially the same
algorithms and energy parameters as RNAfold. It calculates a consensus MFE
which is composed of an energy term averaging the energy contributions of
the single sequences and a covariance term rewarding compensatory and
consistent mutations.

If the sequences in the alignment can fold into a common structure, the
average MFE and the consensus MFE will be of similar dimension. If there is
no common structure, the consensus MFE will be higher (i.e. less stable)
than the average MFE of the single sequences.

Based on this intuitive rationale, a structure conservation index is
defined:

                       consensus MFE
                 SCI=  --------------
                        average MFE


The SCI will be around 0 if RNAalifold does not find a consensus structure,
it will be around 1 if the structure is perfectly conserved. A SCI above 1
indicates a perfectly conserved secondary structure which is even supported
by compensatory and/or consistent mutations.

2.2. Thermodynamic stability
============================

The significance of a predicted MFE as calculated by RNAfold is difficult
to interpret in absolute terms. It depends on the length and the base
composition of the sequences (longer sequence => lower MFE, higher
GC-content => lower MFE). Typically the significane of a MFE is estimated
by comparing to many random sequences of the same length and base
composition. If mu is the mean and sigma the standard deviation of the MFEs
of many random sequences a convenient normalized measure for the
significance of the native sequence with MFE m is a z-score:

                          m - mu
                    z = ---------
                          sigma

RNAz can effeciently calculate z-scores without sampling. Negative z-scores
indicate that the native sequence is more stable than the random
background. The unit of z-scores are standard deviations. Random MFEs can
be roughly approximated by a standard normal distribution which gives an
impression of associated P values (e.g. z=-2 => P=0.98).



2.3. Classification based on both scores
========================================

Both scores represent independent diagnostic features of functional
RNAs. RNAz combines both into a overall score using a support vector
machine regression. Depending on the SCI and z, but also the number of
sequences in the alignment and the mean pairwise identity a RNA
class-probability is calculated. It should be noted that the level of the
SCI depends on the sequence similarity. 100% conservation for example
results in a SCI of 1 per definition but do not hold any information for
our purpose. Thus, the support vector machine was taught to interpret the
significance of the SCI depending on the sequence variation.

The confidence level of this RNA-class probability or "RNAz P-value"
slightly varies depending on the properties of the input alignment. In our
tests, P=0.5 and P=0.9 had specificities of 96% and 99%.


3. Usage
========

3.1. Installation
=================

See INSTALL for details.


3.3. Invocation
===============

RNAz [options] [filename]

RNAz takes an alignment file in the ClustalW or MAF format. Available
command line options:

  -f, --forward           Score forward strand
  -r, --reverse           Score reverse strand
  -b, --both-strands      Score both strands
  -o, --outfile=FILENAME  Output filename
  -p, --cutoff=FLOAT      Probability cutoff
  -d, --dinucleotide      Use dinucleotide shuffled z-scores (default)
  -m, --mononucleotide    Use mononucleotide shuffled z-scores (default)
  -l, --locarnate         Use decision model for structural alignments (default=off)
  -n, --no-shuffle        Never fall back to shuffling (default=off)
  -h, --help              Print help screen
  -V, --version           Show version information


You can test RNAz on one of the example alignments (installed 
by default in /usr/local/share/RNAz/examples)

cd /usr/local/share/RNAz/examples

RNAz tRNA.aln


3.4. Output
===========

Please refer to the following commented sample output:

Header:

############################  RNAz 2.1  ##############################

Sequences: 4 ... Number of sequences in the alignment
Columns: 73  ... Number of columns of the alignment
Reading direction: forward ... Strand considered for calculation 
                                ("forward" or "reverse")
Mean pairwise identity:  80.82 ... Mean pairwise sequence identity in %
Shannon entropy: 0.31118 ... Metric for sequence diversity taking into 
                             accout also number of sequences
G+C content: 0.54795 ... GC-content
Mean single sequence MFE: -27.20 ... Average mean pairwise identity of 
Consensus MFE: -26.50 ... consensus MFE calculated by RNAalifold
Energy contribution: -23.63 ... RNAalifold energy part
Covariance contribution:  -2.87 ... RNAalifold covariance part
Combinations/Pair:   1.43 ... Number of different base pair combination
                              per predicted consensus base-pair
Mean z-score:  -1.82
Structure conservation index:   0.97
Background model: dinucleotide ... Type of background model (di- or
                                   mononucleotide) used to calculate
                                   z-score
Decision model: sequence based alignment quality  ... see --locarnate 
                                                      option
SVM decision value:   2.15 ... Internal decision value, probably 
                               not of much interest...
SVM RNA-class probability: 0.984068  <- Probability estimate for the
                                        classification, "RNAz P-value"
Prediction: RNA  ... "RNA" if P>=0.5 else "OTHER"

######################################################################

>sacCer1
GCCUUGUUGGCGCAAUCGGUAGCGCGUAUGACUCUUAAUCAUAAGGUUAGGGGUUCGAGCCCCCUACAGGGCU
(((((((.(((((........))))...((((.((((....))))))))(((((....)))))).))))))). ( -29.20, z-score =  -2.35, R)
>sacBay
GCCUUGUUGGCGCAAUCGGUAGCGCGUAUGACUCUUAAUCAUAAGGUUAGGGGUUCGAGCCCCCUACAGGGCU
(((((((.(((((........))))...((((.((((....))))))))(((((....)))))).))))))). ( -29.20, z-score =  -2.35, R)
>sacKlu
GCCUUGUUGGCGCAAUCGGUAGCGCGUAUGACUCUUAAUCAUAAGGCUAGGGGUUCGAGCCCCCUACAGGGCU
(((((((.(((((........)))).(((((.......)))))......(((((....)))))).))))))). ( -27.20, z-score =  -1.34, R)
>sacCas
GCUUCAGUAGCUCAGUCGGAAGAGCGUCAGUCUCAUAAUCUGAAGGUCGAGAGUUCGAACCUCCCCUGGAGCA
(((((((..((((........)))).((((.........))))((((((......)).))))...))))))). ( -23.20, z-score =  -1.22, R)
>consensus
GCCUUGUUGGCGCAAUCGGUAGCGCGUAUGACUCUUAAUCAUAAGGUUAGGGGUUCGAGCCCCCUACAGGGCU
(((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))). (-26.50 = -23.63 +  -2.87) 


Secondary structure predictions:

The next lines summarize the secondary structure predictions in the
following format:

>sequence name
SEQUENCE
STRUCTURE (MFE, Z-SCORE, R|S)

The structure is indicated in dot bracket notation: '.' denotes a
unpaired base, while '(' and ')' denote base pairs. Energies are given
in kcal/Mol.  "R" means z-score was calculated by regression. "S"
means z-score was estimated by shuffling.

The last structure is the RNAalifold consensus structure with the consensus
MFE (broken down in energy and covariance contribution)

7. Advanced usage
=================

See the Manual (manual.pdf) for more detailed documentation.


8. Citing RNAz
==============

If you use RNAz in your work please cite:

Gruber AR, Findeiss, Washietl S, Hofacker IL, and Stadler PF. 
RNAz 2.0: Improved noncoding rna detection. 
Pac Symp Biocomput, 2010. 15:69–79.

or

Washietl S, Hofacker IL, Stadler PF
Fast and reliable prediction of noncoding RNAs.
Proc Natl Acad Sci U S A. 102(7):2454-9 (2005)


9. Contact
==========

Stefan Washietl <[email protected]>