-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
203 lines (173 loc) · 9.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
This suite of programs implements the approach to phylogeny
reconstruction from gene orders described in the paper
Moret, B.M.E., Wyman, S., Bader, D.A., Warnow, T., and Yan, M.,
``A detailed study of breakpoint analysis,''
Proc. 6th Pacific Symp. Biocomputing PSB 2001, Hawaii (2001).
and the revision
Moret, B.M.E., Tang, J., Wang, L.S., and Warnow, T.,
"Steps toward accurate reconstruction of phylogenies from gene-order data,"
J. Comput. Syst. Sci.
In its current state, the program allows one to explore either the
space of all possible trees on n labelled leaves or the space of all
such trees that obey (are refinements of) a particular constraint tree.
There are three executables: grappa (the main program),
invdist, and distmat
A simple "make" (using the Gnu make program) will produce all three.
A sample input file is included, with a name of 1n10k100r4.so. Try to
use "./grappa-e -m -t4 -T4 -n4 -a -f 1n10k100r4.so" to test it
The most useful combination of arguments are listed at the end of this file.
****************************************************************************
Program Limit:
**************
MAX_NUM_GENES: 200
MAX_GENOMES : 20
MAX_STR_LEN : 2048
You can modify these macros in structs.h to accomodate your requirement.
grappa:
*******
Options are as follows:
grappa -f <datafile> [-r <constraintfile>] [-t tspsolver] [-T initsolver] [-n NJsolver] [-K threshold] [-i initmethod] [-o outfile] [-s step] [-N] [-L] [-d] [-S]
[-B] [-c] [-C] [-P] [-a] [-e] [-m] [-l] [-g]
-a: a limited branch-and-bound method to prune low-level trees with
the circular ordering lower bound
-e: EDE correction for inversion distance (refer to the papers for details)
do not use it when pairwise distances are too large, or use fixed-factor (-x)
-x: fixed factor value like 0.95, should use with -e (for correction of large distances)
-m: tighter circular lower bound, recommended
-l: use the layered method (see papers); requires significant disk storage
-r: constraintfile: only generate trees that are
compatible with the parenthesized tree constraint
-C: condense triples before calling TSP solver (may make it faster) (default: OFF)
-t: median solver: [1==Greedy, 2==Exact (default), 3==Inversion median by
A. Siepel, 4==Inversion median by A. Caprara, 5==SimpleLK, 6==ChainedLK]
Note: options 5 and 6 only available if the CONCORDE package is
installed (available from Rice U.)
-T: median solver for initialization: options as for -t
-n: median solver for NJ trees: [1==Greedy, 2==Exact (default), 3==SimpleLK,
4==ChainedLK]
Note: options 3 and 4 only available if the CONCORDE package is
installed (available from Rice U.)
-K: threshold on size to use exact rather than LK solver (default 2)
-i: initmethod: [1==Random, 2==Nearest-Neighbors (small, default for -t 3
or -t 4), 3==Nearest-Neighbors (S/B, default), 4==Nearest-Neighbors
(large), 5==Propagate-Nearest, 6==Propagate-Median, 7==Uniform,
8==Adjacency-Parsimony]
-o: outfile: output filename (default to stdout)
-s: step: next tree is <step> away from current (default: 1)
-N: do not iterate labellings beyond initialization (default: off)
-L: process linear (noncircular) genomes (default: circular)
-d: use breakpoint distance (default: inversion distance)
-b: use the following integer as an upper bound in the search
-B: do not use circular ordering lower bound; implied by -S; default: use the bound)
-S: skip dist. matrix and NJ computations (for timings); also sets -B;
default: compute matrix and NJ
-g: for testing only: greedy methods, in development
-u: cutoff ratio for fixed-factor, default 3/4
-F: start layer (if you have stored the lower bounds), should use with -l
-J: avoid the computation of NJ tree
Dynamic condensing (-C) is recommended for large numbers of genes, as long
as the evolutionary rate is not too high.
The step option (-s <step>) allows one to sample tree space in an unbiased way.
Initialization routines are:
1==Random
constructs a genome at random (uniform distribution of permutations
and also of signs)
2==Nearest-Neighbors (small)
sets up a TSP at each internal node to find the median of the
three leaves closest to this node; inefficient recomputation of
identity of leaves, but fast for up to 20 genomes
However, evidences show that this is the best one working with
inversion median.
3==Nearest-Neighbors (Sankoff/Blanchette)
same as 2, except that an internal node, once labeled,
is treated as equivalent to a leaf; appears equal in quality
to 2 and 4, but runs faster
4==Nearest-Neighbors (big)
same as 2, but efficient recomputation of identity of leaves
(done once at beginning, then stored); preferred over 2 for
more than 20 genomes
5==Propagate-Nearest
identifies centroid of tree, then propagates through postorder
traversals of the three subtrees some arbitrarily chosen child
up to the root; at centroid, solves the obvious TSP instance;
very fast and very prone to local optima.
6==Propagate-Median
Same as 5: identifies centroid of tree, then propagates through
postorder traversals of the three subtrees a quick-and-dirty
approximation to the median of two; at centroid, solves the
obvious TSP instance; very fast and very prone to local optima.
7==Uniform
assigns every internal node the same ancestral genome, namely
the identity permutation; yields very poor solutions
8==Adjacency-Parsimony
sets up a TSP instance at every internal node;
each of the three subtrees of an internal node conveys to the node
a "vote" for each adjacency: desirable, indifferent, undesirable.
In propagating the votes, renormalization takes place, but scores
are just added to create the TSP instance, so it can have more than
3n nontrivial edges and edge values in the range from 0 to 6.
Very slow (quadratic time) and not very good.
Note that, when using the LK solvers to compute medians (options 5 and 6),
a threshold determines when to switch automatically to the exact solver
(option 2), because that solver is much faster than the LK solvers on
small instances.
A known problem with the LK solvers is that they require a minimum number
of cities in order to run; if you set the threshold too low, you may force
them to run with fewer than 4 cities, in which case they will crash.
Thus we recommend that you never set the threshold lower than 4.
Bounding is done by computing the circular (inversion or breakpoint) distance
on the leaves of the tree, dividing the result by two, and comparing with
the best so far; if the lower bound exceeds the upper bound, the tree
is discarded without scoring. The tighter version of this bound computed
with the "-m" option is recommended for all usage.
If a good initial upper bound has been obtained by external means,
it can be provided to the program through the -b option;
the program then only prints out trees that match or improve on
this upper bound. The "-g" option is intended to find an alternative way to
get the initial uppaer bound, we are still working on it.
By using the GMP (Gnu Multiple Precision) package, you can run a large
problem (say, with sampling using the -s option) and still be able to
count the trees, but be warned that GMP arithmetic is very slow.
Using GMP is a compile-time option, not a command-line option.
If you do not use GMP, then the tree count can rapidly overflow, giving
you useless information; when running a complete analysis, we recommend
that you turn off the tree counting.
The layered approach is effecitve, however, it is not parrallelized and
can only be used on single processor, with large requirement of hard
disk capacity.
There are two methods to compute inversion median, should always used
like the following:
grappa -t3 -T3 -f ... (for Siepel's median)
or
grappa -t4 -T4 -f ... (for Caprara's median)
The initial method for inversion median defaults to Nearest-Neighbor (small)
(option of "-i2"), not the SNN used in other cases.
Most useful combination of arguments:
*************************************
1. If the pariwise distances are not very large, the following arguments are
most used:
./grappa -t4 -T4 -n4 -e -m -a -f ...
or
./grappa -t4 -T4 -n4 -e -m -l -f
If you have large genomes, condensation is a good way to reduce the time:
./grappa -t4 -T4 -n4 -e -m -a -C -f ...
2. If you have large pairwise distance, try the fixed-factor:
./grappa -t4 -T4 -n4 -C -e -m -x 0.95 -l -C -f ...
If scoring a tree seems forever, try this:
./grappa -t2 -T2 -n2 -C -e -m -x 0.95 -l -C -f ...
This will invoke the breakpoint phylogeny, which may not be very accurate.
Layered method is aimed at difficult cases.
invdist:
********
A symbolic link to grappa, this simply takes the first two
genomes in the input file and returns their inversion distance.
The only options used are -f, to specify the file from which to get
the first two genomes, and -o, to specify the output file (if desired);
all other command-line options are discarded.
distmat:
********
A symbolic link to grappa, this simply prints the inversion and
breakpoint distance matrices.
The only options used are -f, to specify the file from which to get
the genomes, and -o, to specify the output file (if desired);
all other command-line options are discarded.