You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added more details about how (just about all) of the Algorithms are implemented.
I still need to add a bit more explanation about the ONJ algorithm
(as it doesn't actually have a D matrix, but it still carries out the
described operations on the I and S matrices).
Copy file name to clipboardExpand all lines: doco/Algorithms.md
+30-8Lines changed: 30 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -36,18 +36,17 @@ appear to be slightly slower) than the NJ-R and NJ-R-V implementations.
36
36
| BIONJ-R | D, V, S, I | BIONJ with branch-and-bound optimization | Recommended. But slower than NJ-R |
37
37
| AUCTION | D, S, I | Reverse auction cluster joining | Experimental. Not recommended |
38
38
39
-
<h2>Other commonon features</h2>
39
+
<h2>Other common features</h2>
40
40
All of the distance-based algorithms implemented in decentTree make use of distance matrices.
41
41
Distance-*matrix* algorithms take, as their principle input (apart from a list of names of the N Taxa),
42
42
an N row, N column matrix of distances; the distance between taxa *a* and *b* can be read by
43
43
looking at the *b*th entry in the *a*th row (or the *a*th entry in the *b*th row,
44
44
if distances are symmetric). In practice, distances *are* symmetric and
45
45
the distance from a and b is the same as the distance from b and a.
46
-
The distance between any sequence and itelf is assumed to be zero.
47
-
46
+
The distance between any sequence and itelf is assumed to be zero.<br><br>
48
47
Uncorrected distances are typically calculated by counting the number of characters that must differ,
49
48
between two sequences in a sequence alignment, and dividing the count by the total number
50
-
of sites in both sequences. Corrected distances are calculated by adjusting the uncorrected distance using a Jukes-Cantor (or similar) correction.
49
+
of sites in both sequences. Corrected distances are calculated by adjusting the uncorrected distance using a Jukes-Cantor (or similar) correction.<br><br>
51
50
52
51
Distance matrix algorithms can use either uncorrected or corrected distances between taxa. In practice, using uncorrected distances seems to give better results (indeed, the formulae that are used to determine the
53
52
calculated or estimated distances between imputed ancestors, were designed assuming that all distances are
@@ -60,8 +59,7 @@ Neighbour-joining algorithms (except for AUCTION and STICHUP algorithms, which u
60
59
look for neighbours by searching for pairs of clusters (or indivdidual taxa) with a minimal adjusted
61
60
difference. (the literature tends to talk about a Q matrix, where Qij is the adjusted difference between
62
61
the *i*th and *j*th cluster). The details vary from algorithm to algorithm, but typically the adjusted
63
-
distances are calculated by subtracting "compensatory" terms to adjust for how distant the clsuters are,
64
-
on average, from all other clusters. In the literature entries in the Q matrix are calculated as
62
+
distances are calculated by subtracting "compensatory" terms to adjust for how distant each of the two clusters is, on average, from all other clusters. In the literature entries in the Q matrix are calculated as
65
63
66
64
(N-2)*D(x,y) - sum(D row for x) - sum(D row for Y)
67
65
@@ -71,7 +69,17 @@ D(x,y) - (1/(n-2)) * sum(D row for x) - (1/(n-2)) *sum(D row for Y)
71
69
72
70
(or something like it). This is because multiplying by (N/2), frequently (in the case of the simpler algorithms, once for every non-diagonal entry in an (N*N) matrix!), is a lot more expensive than multiplying N row totals by (1/(N-2)), once.
73
71
74
-
[Todo: talk about how, in decentTree, the arrays are real, and also square]
72
+
The initial N*N distance matrix is laid out in memory, in row major order:
73
+
with all the distances in the first row, then all the distances in the second row, and so on. Whenever two clusters, one in row
74
+
(and column) A, one in row (and column) B, A less than B, are joined,
75
+
and replaced by a new cluster (the one that joins them), by
76
+
<ul>
77
+
<li>writing distances to the new cluster in row (and column) A</li>
78
+
<li>overwriting column B with the distances from the last row (and column)</li>
79
+
<li>reducing the number of rows (and columns) by one</li>
80
+
</ul>
81
+
82
+
Pointers to the start of each of the rows are maintained (the starts of rows stay in the same place, what is changing is: which row is mapped to which cluster).
75
83
76
84
Each time the algorithm identifies the pair (a,b) of clusters to me merged next,
77
85
the two clusters, a, and b, are removed, and replaced with the cluster, u, that is their union.
@@ -90,13 +98,27 @@ last column into the vacancies left by the removal of cluster b, and writing
90
98
the entries for cluster u into (what was) the column for cluster a, reduces the
91
99
amount of memory in use (though, not the amount of memory allocated!).
92
100
93
-
Since the sum of the squares is
101
+
Since the sum of the first N squares is N(N+1)(2N+1)/6 (approximately one third of the cube of N), the reduction in the expected number of cache fetches (or cache misses) resulting from access to the distance matrix, over the course of an execution is, for large enough N, about two thirds. <i>If all of the distances in the distance matrix are actually examined, one every iteration, as they are in the NJ and BIONJ algorithms, but <b>not</b> in the NJ-R, BIONJ-R, and the other "-R" algorithms.</i>
94
102
95
103
Maintaining the entire matrix (and not just the upper or lower triangle) makes it
96
104
possible to do the memory accesses almost entirely sequentially (except for the
97
105
column rewriting and moving when clusters are moved.
98
106
107
+
In algorithms that have Variance Estimate matrices, operations (row and column overwrites, row and column deletes) are "mirrored" on the Variance Estimate matrices.
108
+
109
+
Row (but not column!) operations are mirrored on the "sorted distance" (S) and "index" I matrices.
110
+
111
+
Columns cannot easily be deleted out of existing rows of the S and I matrices (if the algorithm has them), because in each row of those arrays, then entries are sorted by ascending distance (so to find out which entry
112
+
is for a column that is to be removed, a search would be necessary, and to
113
+
write an entry for the column for a newly joined cluster, an insert into
114
+
a sorted array would be necessary). The I and S matrices contain entries
115
+
for clusters which have a cluster number less than that of the cluster
116
+
mapped to the row they are in. As neighbour joining continues, some of these will be for clusters that are no longer under consideration, because
117
+
they have already been joined into another, newer, cluster. Distances to
118
+
these clusters are "skipped" over.
119
+
99
120
<h2>Working matrix reallocation</h2>
121
+
During the course of the execution of a distance matrix algorithm, as the number of rows and columns in use falls, less and less of the memory allocated to the matrix remains in use. Periodically, the items still in use in the matrix are moved, so that a smaller block of sequential memory contains all the distances in the matrix (in row major order), with no "unused" memory between rows.
0 commit comments