+ This method aligns detected peaks in different samples through a match score. This score is calculated
+ based on the mass spectrum and retention time of each peak and ranges of tolerance stipulated in the parameters
+ setup dialog.
+
Start by assigning
+ each item to a cluster, so that if you have N items, you now have N clusters,
+ each containing just one item. Let the distances (similarities) between the
+ clusters the same as the distances (similarities) between the items they contain.
+
Find the closest (most similar)
+ pair of clusters and merge them into a single cluster, so that now you have
+ one cluster less.
+
Compute distances (similarities)
+ between the new cluster and each of the old clusters.
+
Repeat steps 2 and 3 until all
+ items are clustered into a single cluster of size N. (*)
+
+
Step 3 can be done
+ in different ways, which is what distinguishes single-linkage from
+ complete-linkage and average-linkage clustering.
+ In single-linkage clustering (also called the connectedness
+ or minimum method), we consider the distance between one cluster and
+ another cluster to be equal to the shortest distance from any member
+ of one cluster to any member of the other cluster. If the data consist of similarities,
+ we consider the similarity between one cluster and another cluster to be equal
+ to the greatest similarity from any member of one cluster to any member
+ of the other cluster.
+ In complete-linkage clustering (also called the diameter or
+ maximum method), we consider the distance between one cluster and another
+ cluster to be equal to the greatest distance from any member of one cluster
+ to any member of the other cluster.
+ In average-linkage clustering, we consider the distance between one
+ cluster and another cluster to be equal to the average distance from
+ any member of one cluster to any member of the other cluster.
+ A variation on average-link clustering is the UCLUS method of R.
+ D'Andrade (1978) - https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html#dandrade - which uses the median distance, which is much more
+ outlier-proof than the average distance.
+
This kind of hierarchical
+ clustering is called agglomerative because it merges clusters iteratively.
+ There is also a divisive hierarchical clustering which does the reverse
+ by starting with all objects in one cluster and subdividing them into smaller
+ pieces. Divisive methods are not generally available, and rarely have been applied.
+
(*) Of course there
+ is no point in having all the N items grouped in a single cluster but, once
+ you have got the complete hierarchical tree, if you want k clusters you just
+ have to cut the k-1 longest links.
+
+
+
+
+
+
>> Available linkage criterion are:
+
+
Single
+
Complete
+
Average
+
+
+
+
The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.
+
Some commonly used linkage criteria between two sets of observations A and B are:
+
+
where d is the chosen metric.
+
+
Linkage methods:
+
+
+
+
+
+
+
MZmine GC: Specific considerations
+
+
>> How distance matrix between all pairs of observations is obtained
+
+
+
Compute a similarity score based on chemical likelihood only
+
Compute a similarity score based on RT likelihood only
+
Determine a combined weighted (=mixture) score based on a mixture of the above two scores
+
+
+ Giving:
+
+
1. Chemical similarity measurement (chemSimScore)
+
+
+
2. Retention time score
+ Retention time score is normalized relatively to the RT window tolerance provided by the user (rtScore = (1.0d - rtDiff / rtMaxDiff))
+
+
3. Mixture score
+ The final mixture score (score = (chemSimScore * mzWeight) + (rtScore * rtWeight)).
+
+
+
+
+
A square matrix is generated by comparing all the scores, for all the pairs of peaks.
+
+
+
>> Getting clusters from rooted binary tree
+
+The algorithm methods described above all lead to a unique binary tree where all items are clustered into a single cluster of size N (number of leafs).
+
+A final step is required to get the very final clusters list. A "cutoff" based on two criterion allows to split the single cluster into appropriate subgroups:
+
+
+
A cluster cannot contain more than one leaf per sample
+
A cluster cannot contain two leafs for which the distance (or mixture similarity score) is to low
+
+
+
Example with a 3 samples binary tree
+
+
+
+
+
+
+
Method parameters
+
+
+
Peak list name
+
Name of the new aligned peak list
+
m/z tolerance
+
This value sets the range, in terms of m/z, for possible peaks to be
+ aligned. Maximum allowed m/z difference
+
Weight for m/z
+
This is the assigned weight for m/z difference at the moment of match score calculation between peak rows.
+ In case of perfectly matching m/z values the score receives the complete weight.
+
Retention time tolerance type
+
Maximum RT difference can be defined either using absolute or relative value
+
Absolute RT tolerance
+
Maximum allowed absolute RT difference
+
Relative RT tolerance
+
Maximum allowed relative RT difference
+
Weight for RT
+
This is the assigned weight for RT difference at the moment of match score calculation between peak rows.
+ In case of perfectly matching RT values the score receives the complete weight.
+
+
Export dendrogram as TXT
+
Results in CDT + GTR files (See bellow how to visualize those files).
+
Dendrogram output text filename
+
Name of the resulting TXT file to write the clustering resulting dendrogram to. If the file already exists, it will be overwritten.
+
+
+
+
+
+
+
+
+ New aligned peak list showing peaks from 3 different samples
+
+
+
+
+
+ Clustering result can be exported into CDT+GTR format
+ The latter can be then browsed using common applications such as TreeView - https://sourceforge.net/projects/jtreeview/.
+
+
+
+
+
+
+
+
Inspiration taken from
+
+
An Optimal Peak Alignment For Comprehensive Two-Dimensional Gas Chromatography Mass Spectrometry Using Mixture Similarity Measure - 2013 [Seongho Kim, Aiqin Fang, Bing Wang, Jaesik Jeong, Xiang Zhang] -
+ https://louisville.edu/faculty/x0zhan17/papers/2011_mspa_bioinfor