Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement outlier scores for HDBSCAN #71

Closed
azizkayumov opened this issue Jul 19, 2024 · 5 comments · Fixed by #73
Closed

Implement outlier scores for HDBSCAN #71

azizkayumov opened this issue Jul 19, 2024 · 5 comments · Fixed by #73

Comments

@azizkayumov
Copy link
Contributor

azizkayumov commented Jul 19, 2024

HDBSCAN's outlier score computation algorithm, GLOSH (Global-Local Outlier Score from Hierarchies), seems to be a great addition, as we are interested in flat clustering results more than clustering hierarchy in this library.

Python HDBSCAN has this feature, but their implementation seems to be a little buggy. Are you interested in adding this feature to HDBSCAN? I would like PR this, but scared of code coverage for now, so let me test it a bit and open PR if you're interested.

@msk
Copy link
Collaborator

msk commented Jul 19, 2024

I'd be happy to review a PR for this feature. Don't be too concerned about code coverage at the initial PR stage. Feel free to open the PR when you're ready, and we can iterate on it.

azizkayumov added a commit to azizkayumov/petal-clustering that referenced this issue Jul 20, 2024
@azizkayumov
Copy link
Contributor Author

Firstly, let me briefly describe how to compute outlier scores from the HDBSCAN* paper, which is GLOSH (Global-Local Outlier Score from Hierarchies) or Equation 8 from the paper below.

To compute the outlier score of a data object x:

  1. Find the first cluster C that x gets attached to in the hierarchy:
    eps_x = the lowest eps that x belongs to C.
  2. Find the lowest eps C:
    eps_C = the lowest eps that C or any of C's child clusters survives w.r.t. min_cluster_size.
  3. Then the outlier score of x is defined as:
    score(x) = 1 - eps_C / eps_x

In the source code, we work with density threshold values (lambdas) instead of eps values.
Given that lambda = 1/eps, then we get the following:

  • lambda_x = the highest lambda that x belongs to C.
  • lambda_C = the highest lambda that C or any of C's child clusters survives w.r.t. min_cluster_size.

Finally, we get the following:
score(x) = 1 - lambda_x / lambda_C

For all data objects, we can compute outlier scores in O(n) time, if we keep track of lambda_C values when we build the clustering hierarchy.

@azizkayumov
Copy link
Contributor Author

When computing lambda_C value of the cluster C, Python HDBSCAN seems to ignore the cluster C's child clusters, so it computes buggy outlier scores. The following plot is the output of Python HDBSCAN.
Please notice the location of the outlier objects with ranks 1 and 2, they are between two big clusters, even on their worst day, they should be less outlier than e.g. data objects ranked 22 or 23.
glosh_pyhdbscan

Compared to the Rust HDBSCAN, here is the better visualization of rankings (I hope they are correctly computed). As data objects get closer and closer to the clusters (w.r.t. min_cluster_size), they naturally become inliers:
glosh_rusthdbscan

@azizkayumov
Copy link
Contributor Author

To reproduce the Python HDBSCAN's bug, please run the following:

import hdbscan
import numpy as np
import matplotlib.pyplot as plt


# Step 1: Generate the data
data = []
centers = [[15, 37], [30, 68], [38, 85]]
radiuses = [5, 5, 1]
counts = [100, 100, 100]
np.random.seed(0)
for i in range(3):
    x0, y0 = centers[i]
    r = radiuses[i]
    n = counts[i]
    for j in range(n):
        x = np.random.normal(x0, r)
        y = np.random.normal(y0, r)
        data.append([x, y])
num_noises = 30 # adding some uniform noise
for i in range(num_noises):
    x = np.random.uniform(0, 100)
    y = np.random.uniform(0, 100)
    data.append([x, y])


# Step 2: Compute the outlier scores
k = 15
clusterer = hdbscan.HDBSCAN(
    alpha=1.0, 
    approx_min_span_tree=False,
    gen_min_span_tree=True,
    metric='euclidean', 
    min_cluster_size=k, 
    min_samples=k,
    allow_single_cluster=True,
    match_reference_implementation=True)
clusterer.fit(data)


# Step 3. Plot the data and the outlier scores
outlier_scores = clusterer.outlier_scores_
plt.scatter([x[0] for x in data], [x[1] for x in data], s=25, c=outlier_scores, cmap='viridis')
plt.colorbar()


# Step 4: Assign rankings and plot top 20 outliers
indices = [i for i in range(len(outlier_scores))]
indices.sort(key=lambda x: outlier_scores[x])
last_30 = indices[-30:]
last_30 = reversed(last_30)
for i, idx in enumerate(last_30):
    plt.text(data[idx][0], data[idx][1], str(i+1), fontsize=10, color='black')
plt.title("PyHDBSCAN: outlier scores & rankings (top 30)")
plt.show()

I tried to generate a dataset similar to the example data illustrated in HDBSCAN* paper (Figure 10) to run both Python and Rust HDBSCANs on the same dataset.

@azizkayumov
Copy link
Contributor Author

The last commit includes a test case that can be visualized easily and also Python HDBSCAN would fail to correctly rank outliers. Comparing to the output of this PR, the outliers are correctly ranked:
glosh_rusthdbscan
The code coverage is complaining about some missing parts, otherwise, please let me know if I can clarify further.

@msk msk closed this as completed in #73 Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants