Skip to content

Latest commit

 

History

History
76 lines (51 loc) · 3.02 KB

README.md

File metadata and controls

76 lines (51 loc) · 3.02 KB

Distributed Clustering with Outliers

This is the experiment code for our paper

X. Guo, S. Li. Distributed k-Clustering for Data with Heavy Noise. NIPS'18

Requirement

  • Python >= 3.4
  • numpy >= 1.15 and scipy >= 1.2.0
  • scikit-learn >= 17.0

Testing

To run the code, please see the kzcenter_exp.py file as example.

Progress

  1. Data preprocessing
  2. (k,z)-center algorithm
  3. coreset
  4. (k,z)-median
  5. (k,z)-means

Summary

All the algorithm implementations are under clusterz/algs:

  • kzcenter.py: includes the implementation of our distributed (k,z)-center algorithm, the algorithm by Malkomes et al.[1], the centralized (k,z)-center algorithm by Charikar et al.[2], and the 2-approx k-center algorithm by Hochbaum and Shmoys[3]

  • kz_clustering_from_others.py: contains the implementation for some (k,z)-clustering algorithms proposed recently, including Guha et al.[4] and Chen et al.[5]

  • kz_lp_clustering.py: contains the implementation for our distributed (k,z)-median/means algorithm. Actually this implementation works for any L_p norms.

  • kzmeans.py: instantiate the distributed L_p clustering routine defined in kz_lp_clustering.py with L_2 norm, thus obtain a distributed (k,z)-means routine. Also contains implementations for centralized k/(k,z)-means algorithm based on Lloyd[6] and Chawla and Gionis[7]

  • kzmedian.py: instantiate the distributed L_p clustering routine defined in kz_lp_clustering.py with L_1 norm, thus obtain a distributed (k,z)-median routine. Also contains implementations for centralized k/(k,z)-means algorithm based on Lloyd[6] and Chawla and Gionis[7]

  • misc.py: utility functions for the classes in kzcenter.py. Maily facilitate the ball-cover step.

  • robust_facility_location.py: centralized solver for the min-max k-clustering problem based on Anthony et al.[8]. Will be invoked by our distributed (k,z)-means/median algorithms as the final step.

[1] Gustavo Malkomes, Matt J. Kusner, Wenlin Chen,Kilian Q. Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. NIPS'15

[2] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. SODA'01

[3] Dorit S. Hochbaum and David B. Shmoys. A best possible heuristic for the k-center problem. Math. Oper. Res., 1985

[4] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. SPAA'17

[5] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering and outlier detection. NIPS'18

[6] Stuart P. Lloyd: Least squares quantization in PCM. IEEE Trans. Information Theory. 1982

[7] Sanjay Chawla and Aristides Gionis. k-means−−: A unified approach to clustering and outlier detection. ICDM'13

[8] Barbara M. Anthony, Vineet Goyal, Anupam Gupta, and Viswanath Nagarajan. A plant location guide for the unsure: Approximation algorithms for min-max location problems. Math. Oper. Res., 2010.