Skip to content

Commit ba6991f

Browse files
James BarbettiJames Barbetti
James Barbetti
authored and
James Barbetti
committed
Added more information about algorithms.
1 parent 01f73c2 commit ba6991f

File tree

1 file changed

+80
-1
lines changed

1 file changed

+80
-1
lines changed

doco/Algorithms.md

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ appear to be slightly slower) than the NJ-R and NJ-R-V implementations.
2424
| ---- | -------- | ----------- | ----- |
2525
| UPGMA | D | UPGMA (assumes tree is ultrametric) | Not recommended |
2626
| UPGMA-V | D | Vectorized version of UPGMA. | Not recommended |
27-
| NJ. | D | Neighbor Joining | |
27+
| NJ | D | Neighbor Joining | |
2828
| NJ-V | D | Vectorized version of Neighbor Joining | |
2929
| NJ-R | D, S, I | NJ with branch and bound optimization | Recommended |
3030
| NJ-R-D | D, S, I. | Double precision version of NJ-R. | |
@@ -36,3 +36,82 @@ appear to be slightly slower) than the NJ-R and NJ-R-V implementations.
3636
| BIONJ-R | D, V, S, I | BIONJ with branch-and-bound optimization | Recommended. But slower than NJ-R |
3737
| AUCTION | D, S, I | Reverse auction cluster joining | Experimental. Not recommended |
3838

39+
<h2>Other commonon features</h2>
40+
All of the distance-based algorithms implemented in decentTree make use of distance matrices.
41+
Distance-*matrix* algorithms take, as their principle input (apart from a list of names of the N Taxa),
42+
an N row, N column matrix of distances; the distance between taxa *a* and *b* can be read by
43+
looking at the *b*th entry in the *a*th row (or the *a*th entry in the *b*th row,
44+
if distances are symmetric). In practice, distances *are* symmetric and
45+
the distance from a and b is the same as the distance from b and a.
46+
The distance between any sequence and itelf is assumed to be zero.
47+
48+
Uncorrected distances are typically calculated by counting the number of characters that must differ,
49+
between two sequences in a sequence alignment, and dividing the count by the total number
50+
of sites in both sequences. Corrected distances are calculated by adjusting the uncorrected distance using a Jukes-Cantor (or similar) correction.
51+
52+
Distance matrix algorithms can use either uncorrected or corrected distances between taxa. In practice, using uncorrected distances seems to give better results (indeed, the formulae that are used to determine the
53+
calculated or estimated distances between imputed ancestors, were designed assuming that all distances are
54+
uncorrected, and using them on corrected distances does not really make mathematical sense).
55+
56+
decentTree can be supplied a sequence alignment (rather than a distance matrix). It will
57+
calculate corrected (or uncorrected) distances between the taxa in the alginment.
58+
59+
Neighbour-joining algorithms (except for AUCTION and STICHUP algorithms, which use raw distances) tend to
60+
look for neighbours by searching for pairs of clusters (or indivdidual taxa) with a minimal adjusted
61+
difference. (the literature tends to talk about a Q matrix, where Qij is the adjusted difference between
62+
the *i*th and *j*th cluster). The details vary from algorithm to algorithm, but typically the adjusted
63+
distances are calculated by subtracting "compensatory" terms to adjust for how distant the clsuters are,
64+
on average, from all other clusters. In the literature entries in the Q matrix are calculated as
65+
66+
(N-2)*D(x,y) - sum(D row for x) - sum(D row for Y)
67+
68+
(or something like it). In practice, decentTree tends to divide by N-2, so:
69+
70+
D(x,y) - (1/(n-2)) * sum(D row for x) - (1/(n-2)) *sum(D row for Y)
71+
72+
(or something like it). This is because multiplying by (N/2), frequently (in the case of the simpler algorithms, once for every non-diagonal entry in an (N*N) matrix!), is a lot more expensive than multiplying N row totals by (1/(N-2)), once.
73+
74+
[Todo: talk about how, in decentTree, the arrays are real, and also square]
75+
76+
Each time the algorithm identifies the pair (a,b) of clusters to me merged next,
77+
the two clusters, a, and b, are removed, and replaced with the cluster, u, that is their union.
78+
Assume, that a is the cluster with the row that appears first in the matrix.
79+
- Row a is overwritten with row u (likewise column a)
80+
- Row b is overwritten with the contents of the last row (likewise, column b)
81+
- Whatever cluster was mapped to the last row (and column) is "remapped" to row (and column) b
82+
- The rank of the matrix is reduced by one.
83+
84+
Another approach that is often used in distance matrix algorithms is virtual
85+
deletion; the "marking" the merged clusters (a and b) as "no longer in use".
86+
But doing this, the memory for the (working distance matrix) D entries that
87+
refer to the "retired" clusters remains in use. The issue isn't that it isn't
88+
deallocated, but that it needs to be read. Moving the entries in (what was) the
89+
last column into the vacancies left by the removal of cluster b, and writing
90+
the entries for cluster u into (what was) the column for cluster a, reduces the
91+
amount of memory in use (though, not the amount of memory allocated!).
92+
93+
Since the sum of the squares is
94+
95+
Maintaining the entire matrix (and not just the upper or lower triangle) makes it
96+
possible to do the memory accesses almost entirely sequentially (except for the
97+
column rewriting and moving when clusters are moved.
98+
99+
<h2>Working matrix reallocation</h2>
100+
101+
102+
<h2>Treatment of duplicates</h2>
103+
All of the decentTree distance-matrix algorithms have special treatment for sequences that
104+
can be treated as identical (or; for taxa whose rows in initial distance matrix are identical).
105+
106+
<h2>Tie-breaking</h2>
107+
108+
<h2>Treatment of Rounding Error</h2>
109+
Rounding error is ignored.
110+
111+
<h2>Comments</h2>
112+
Removing columns from the *right* (and rows from the *bottom*) was probably a mistake.
113+
Columns should have been removed from the left (and matrix row pointers incremented).
114+
There would have been a cache utilisation advantage, since the entries being moved
115+
would have been near at least some of the entries to be read at the start of the
116+
next search of a pair of clusters to merge. Similarly, rows should have been removed
117+
from the *top* of the matrix.

0 commit comments

Comments
 (0)