Skip to content
Arnold Kuzniar edited this page May 20, 2016 · 70 revisions

Cluster analysis

An excellent general review of clustering techniques is given by Jain et al. (1999). In general, the goal of clustering is to partitions an input data set into non-overlapping (disjoint) groups based on (sparse) similarity matrix in such a way that the within-group similarities are maximized while the between-group similarities are minimized. Some algorithms also enable to cluster data into overlapping groups, in which each member is associated with a group membership score (e.g., probability).

Clustering is a non-trivial task that involves multiple steps such as data pre-processing (outliers), similarity/distance calculations between data points and clusters themselves, cluster detection and validation. In particular, decisions needs to be made about a similarity/distance measure, clustering algorithm (e.g. agglomerative vs. divisive, hierarchical vs. flat etc.), and a validation criterion and dataset. Each clustering method has pros and cons which can be investigated on different benchmark datasets. For this, many validation techniques have been developed in the domain of data-mining (Halkidi et al., 2001; Meila, 2007).

Clustering algorithms can also be classified according to the way similarities/distances are computed between clusters (linkage categories) namely single-, average- and complete-linkage. These linkage criteria are implemented as functions that return the minimum (nearest neighbor or single link), average (average link) or maximum (complete link) value of all similarities between data points. In graph-theoretic terms, the type of linkage relates to the concepts of closeness and connectivity between nodes in a graph. Methods based on single-linkage and complete-linkage correspond to finding maximal connected subgraphs and maximal completely-connected subgraphs (cliques), respectively. Whilst the former yields somewhat 'loose' clusters, the latter results in 'tight' clusters. An intermediate between the two is the average-linkage method.

One of the major computational bottleneck in similarity-based clustering is the calculation of all-versus-all similarities as the computation scales quadratically with respect to the number of data points.

Clustering tools and libraries:

Python: scipy.cluster.hierarchy

Python: scikit-learn

Python: HDBSCAN

C++/Python/R/Matlab bindings: Barnes-Hut t-SNE

C: Markov Cluster Algorithm (MCL)

C: Netclust

C: CLUTO

R packages

Java: ELKI

Java: Carrot2

Matlab: Spectral Clustering Toolbox

Computational phylogenetics tools

Spark MLlib

Misc tools and libraries:

C: CompLearn/Normalized Compression Distance (NCD)

Java: MINE statistics

Python: ETE Toolkit - analysis and visualization of trees

Tutorials

SciPy Hierarchical Clustering and Dendrogram Tutorial

Another SciPy Hierarchical Clustering Tutorial

Optimizing Parallel Reduction in Cuda

Evaluating the results of clustering

Theory on evaluating clustering results

SciKit-Learn's manual on cluster evaluation

Experiences with Hierarchical Clustering

Experiences with Hierarchical Clustering

Porting the Normalized Cross Correlation (NCC) Java Code to a GPU

Porting NCC to GPU

Common Image Source Identification Application

Matlab source and documentation
Dresden image database

Summary of Noise Analysis

Summary of Noise Analysis

References:

[1] Jain, A. K., Murty, M. N. & Flynn, P. J. (1999) Data clustering: a review, ACM computing surveys (CSUR), Acm, 1999, 31, 264-323.

[2] van Dongen, S. (2000) Graph Clustering by Flow Simulation, PhD thesis, University of Utrecht, The Netherlands.

[3] van der Maaten, L.J.P. (2009) Feature Extraction from Visual Data, PhD Thesis, Tilburg University, The Netherlands.

[4] Meila, M. (2007) Comparing clusterings: an information based distance, J Multivar Anal 98, 873-895.

[5] Duda, R. O., Hart, P. E. and Stork, D. G. (2000) Pattern Classification (2nd Edition), Wiley-Interscience, New York, USA.

[6] Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) On Clustering Validation Techniques., J Intell Inf Syst 17, 107-145.

[7] Zimek, A. (2009) Correlation clustering, ACM SIGKDD Explorations Newsletter 11 (1), 53–54.

[8] Cilibrasi, R. and Vitanyi, P.M.B. (2005) Clustering by compression, IEEE Transactions on Information Theory, 51, 1523-1545.