Skip to content

Commit e8110f6

Browse files
James BarbettiJames Barbetti
James Barbetti
authored and
James Barbetti
committed
Added more details about how (just about all) of the Algorithms are implemented.
I still need to add a bit more explanation about the ONJ algorithm (as it doesn't actually have a D matrix, but it still carries out the described operations on the I and S matrices).
1 parent 271c128 commit e8110f6

File tree

1 file changed

+30
-8
lines changed

1 file changed

+30
-8
lines changed

doco/Algorithms.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -36,18 +36,17 @@ appear to be slightly slower) than the NJ-R and NJ-R-V implementations.
3636
| BIONJ-R | D, V, S, I | BIONJ with branch-and-bound optimization | Recommended. But slower than NJ-R |
3737
| AUCTION | D, S, I | Reverse auction cluster joining | Experimental. Not recommended |
3838

39-
<h2>Other commonon features</h2>
39+
<h2>Other common features</h2>
4040
All of the distance-based algorithms implemented in decentTree make use of distance matrices.
4141
Distance-*matrix* algorithms take, as their principle input (apart from a list of names of the N Taxa),
4242
an N row, N column matrix of distances; the distance between taxa *a* and *b* can be read by
4343
looking at the *b*th entry in the *a*th row (or the *a*th entry in the *b*th row,
4444
if distances are symmetric). In practice, distances *are* symmetric and
4545
the distance from a and b is the same as the distance from b and a.
46-
The distance between any sequence and itelf is assumed to be zero.
47-
46+
The distance between any sequence and itelf is assumed to be zero.<br><br>
4847
Uncorrected distances are typically calculated by counting the number of characters that must differ,
4948
between two sequences in a sequence alignment, and dividing the count by the total number
50-
of sites in both sequences. Corrected distances are calculated by adjusting the uncorrected distance using a Jukes-Cantor (or similar) correction.
49+
of sites in both sequences. Corrected distances are calculated by adjusting the uncorrected distance using a Jukes-Cantor (or similar) correction.<br><br>
5150

5251
Distance matrix algorithms can use either uncorrected or corrected distances between taxa. In practice, using uncorrected distances seems to give better results (indeed, the formulae that are used to determine the
5352
calculated or estimated distances between imputed ancestors, were designed assuming that all distances are
@@ -60,8 +59,7 @@ Neighbour-joining algorithms (except for AUCTION and STICHUP algorithms, which u
6059
look for neighbours by searching for pairs of clusters (or indivdidual taxa) with a minimal adjusted
6160
difference. (the literature tends to talk about a Q matrix, where Qij is the adjusted difference between
6261
the *i*th and *j*th cluster). The details vary from algorithm to algorithm, but typically the adjusted
63-
distances are calculated by subtracting "compensatory" terms to adjust for how distant the clsuters are,
64-
on average, from all other clusters. In the literature entries in the Q matrix are calculated as
62+
distances are calculated by subtracting "compensatory" terms to adjust for how distant each of the two clusters is, on average, from all other clusters. In the literature entries in the Q matrix are calculated as
6563

6664
(N-2)*D(x,y) - sum(D row for x) - sum(D row for Y)
6765

@@ -71,7 +69,17 @@ D(x,y) - (1/(n-2)) * sum(D row for x) - (1/(n-2)) *sum(D row for Y)
7169

7270
(or something like it). This is because multiplying by (N/2), frequently (in the case of the simpler algorithms, once for every non-diagonal entry in an (N*N) matrix!), is a lot more expensive than multiplying N row totals by (1/(N-2)), once.
7371

74-
[Todo: talk about how, in decentTree, the arrays are real, and also square]
72+
The initial N*N distance matrix is laid out in memory, in row major order:
73+
with all the distances in the first row, then all the distances in the second row, and so on. Whenever two clusters, one in row
74+
(and column) A, one in row (and column) B, A less than B, are joined,
75+
and replaced by a new cluster (the one that joins them), by
76+
<ul>
77+
<li>writing distances to the new cluster in row (and column) A</li>
78+
<li>overwriting column B with the distances from the last row (and column)</li>
79+
<li>reducing the number of rows (and columns) by one</li>
80+
</ul>
81+
82+
Pointers to the start of each of the rows are maintained (the starts of rows stay in the same place, what is changing is: which row is mapped to which cluster).
7583

7684
Each time the algorithm identifies the pair (a,b) of clusters to me merged next,
7785
the two clusters, a, and b, are removed, and replaced with the cluster, u, that is their union.
@@ -90,13 +98,27 @@ last column into the vacancies left by the removal of cluster b, and writing
9098
the entries for cluster u into (what was) the column for cluster a, reduces the
9199
amount of memory in use (though, not the amount of memory allocated!).
92100

93-
Since the sum of the squares is
101+
Since the sum of the first N squares is N(N+1)(2N+1)/6 (approximately one third of the cube of N), the reduction in the expected number of cache fetches (or cache misses) resulting from access to the distance matrix, over the course of an execution is, for large enough N, about two thirds. <i>If all of the distances in the distance matrix are actually examined, one every iteration, as they are in the NJ and BIONJ algorithms, but <b>not</b> in the NJ-R, BIONJ-R, and the other "-R" algorithms.</i>
94102

95103
Maintaining the entire matrix (and not just the upper or lower triangle) makes it
96104
possible to do the memory accesses almost entirely sequentially (except for the
97105
column rewriting and moving when clusters are moved.
98106

107+
In algorithms that have Variance Estimate matrices, operations (row and column overwrites, row and column deletes) are "mirrored" on the Variance Estimate matrices.
108+
109+
Row (but not column!) operations are mirrored on the "sorted distance" (S) and "index" I matrices.
110+
111+
Columns cannot easily be deleted out of existing rows of the S and I matrices (if the algorithm has them), because in each row of those arrays, then entries are sorted by ascending distance (so to find out which entry
112+
is for a column that is to be removed, a search would be necessary, and to
113+
write an entry for the column for a newly joined cluster, an insert into
114+
a sorted array would be necessary). The I and S matrices contain entries
115+
for clusters which have a cluster number less than that of the cluster
116+
mapped to the row they are in. As neighbour joining continues, some of these will be for clusters that are no longer under consideration, because
117+
they have already been joined into another, newer, cluster. Distances to
118+
these clusters are "skipped" over.
119+
99120
<h2>Working matrix reallocation</h2>
121+
During the course of the execution of a distance matrix algorithm, as the number of rows and columns in use falls, less and less of the memory allocated to the matrix remains in use. Periodically, the items still in use in the matrix are moved, so that a smaller block of sequential memory contains all the distances in the matrix (in row major order), with no "unused" memory between rows.
100122

101123

102124
<h2>Treatment of duplicates</h2>

0 commit comments

Comments
 (0)