This is the experiment code for our paper
X. Guo, S. Li. Distributed k-Clustering for Data with Heavy Noise. NIPS'18
- Python >= 3.4
numpy
>= 1.15 andscipy
>= 1.2.0scikit-learn
>= 17.0
To run the code, please see the kzcenter_exp.py
file as example.
- Data preprocessing
- (k,z)-center algorithm
- coreset
- (k,z)-median
- (k,z)-means
All the algorithm implementations are under clusterz/algs
:
-
kzcenter.py
: includes the implementation of our distributed (k,z)-center algorithm, the algorithm by Malkomes et al.[1], the centralized (k,z)-center algorithm by Charikar et al.[2], and the 2-approx k-center algorithm by Hochbaum and Shmoys[3] -
kz_clustering_from_others.py
: contains the implementation for some (k,z)-clustering algorithms proposed recently, including Guha et al.[4] and Chen et al.[5] -
kz_lp_clustering.py
: contains the implementation for our distributed (k,z)-median/means algorithm. Actually this implementation works for any L_p norms. -
kzmeans.py
: instantiate the distributed L_p clustering routine defined inkz_lp_clustering.py
with L_2 norm, thus obtain a distributed (k,z)-means routine. Also contains implementations for centralized k/(k,z)-means algorithm based on Lloyd[6] and Chawla and Gionis[7] -
kzmedian.py
: instantiate the distributed L_p clustering routine defined inkz_lp_clustering.py
with L_1 norm, thus obtain a distributed (k,z)-median routine. Also contains implementations for centralized k/(k,z)-means algorithm based on Lloyd[6] and Chawla and Gionis[7] -
misc.py
: utility functions for the classes inkzcenter.py
. Maily facilitate the ball-cover step. -
robust_facility_location.py
: centralized solver for the min-max k-clustering problem based on Anthony et al.[8]. Will be invoked by our distributed (k,z)-means/median algorithms as the final step.
[1] Gustavo Malkomes, Matt J. Kusner, Wenlin Chen,Kilian Q. Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. NIPS'15
[2] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. SODA'01
[3] Dorit S. Hochbaum and David B. Shmoys. A best possible heuristic for the k-center problem. Math. Oper. Res., 1985
[4] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. SPAA'17
[5] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering and outlier detection. NIPS'18
[6] Stuart P. Lloyd: Least squares quantization in PCM. IEEE Trans. Information Theory. 1982
[7] Sanjay Chawla and Aristides Gionis. k-means−−: A unified approach to clustering and outlier detection. ICDM'13
[8] Barbara M. Anthony, Vineet Goyal, Anupam Gupta, and Viswanath Nagarajan. A plant location guide for the unsure: Approximation algorithms for min-max location problems. Math. Oper. Res., 2010.