-
Notifications
You must be signed in to change notification settings - Fork 4
Home
An excellent general review of clustering techniques is given by Jain et al. (1999). In general, the goal of clustering is to partitions an input data set into non-overlapping (disjoint) groups based on (sparse) similarity matrix in such a way that the within-group similarities are maximized while the between-group similarities are minimized. Some algorithms also enable to cluster data into overlapping groups, in which each member is associated with a group membership score (e.g., probability).
Clustering is a non-trivial task that involves multiple steps such as data pre-processing (outliers), similarity/distance calculations between data points and clusters themselves, cluster detection and validation. In particular, decisions needs to be made about a similarity/distance measure, clustering algorithm (e.g. agglomerative vs. divisive, hierarchical vs. flat etc.), and a validation criterion and dataset. Each clustering method has pros and cons which can be investigated on different benchmark datasets. For this, many validation techniques have been developed in the domain of data-mining (Halkidi et al., 2001; Meila, 2007).
Clustering algorithms can also be classified according to the way similarities/distances are computed between clusters (linkage categories) namely single-, average- and complete-linkage. These linkage criteria are implemented as functions that return the minimum (nearest neighbor or single link), average (average link) or maximum (complete link) value of all similarities between data points. In graph-theoretic terms, the type of linkage relates to the concepts of closeness and connectivity between nodes in a graph. Methods based on single-linkage and complete-linkage correspond to finding maximal connected subgraphs and maximal completely-connected subgraphs (cliques), respectively. Whilst the former yields somewhat 'loose' clusters, the latter results in 'tight' clusters. An intermediate between the two is the average-linkage method.
One of the major computational bottleneck in similarity-based clustering is the calculation of all-versus-all similarities as the computation scales quadratically with respect to the number of data points.
Python: scipy.cluster.hierarchy
C++/Python/R/Matlab bindings: Barnes-Hut t-SNE
C: Markov Cluster Algorithm (MCL)
Matlab: Spectral Clustering Toolbox
Computational phylogenetics tools
C: CompLearn/Normalized Compression Distance (NCD)
Python: ETE Toolkit - analysis and visualization of trees
SciPy Hierarchical Clustering and Dendrogram Tutorial
Another SciPy Hierarchical Clustering Tutorial
Optimizing Parallel Reduction in Cuda
Theory on evaluating clustering results
SciKit-Learn's manual on cluster evaluation
Experiences with Hierarchical Clustering
Matlab source and documentation
Dresden image database
[1] Jain, A. K., Murty, M. N. & Flynn, P. J. (1999) Data clustering: a review, ACM computing surveys (CSUR), Acm, 1999, 31, 264-323.
[2] van Dongen, S. (2000) Graph Clustering by Flow Simulation, PhD thesis, University of Utrecht, The Netherlands.
[3] van der Maaten, L.J.P. (2009) Feature Extraction from Visual Data, PhD Thesis, Tilburg University, The Netherlands.
[4] Meila, M. (2007) Comparing clusterings: an information based distance, J Multivar Anal 98, 873-895.
[5] Duda, R. O., Hart, P. E. and Stork, D. G. (2000) Pattern Classification (2nd Edition), Wiley-Interscience, New York, USA.
[6] Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) On Clustering Validation Techniques., J Intell Inf Syst 17, 107-145.
[7] Zimek, A. (2009) Correlation clustering, ACM SIGKDD Explorations Newsletter 11 (1), 53–54.
[8] Cilibrasi, R. and Vitanyi, P.M.B. (2005) Clustering by compression, IEEE Transactions on Information Theory, 51, 1523-1545.