Implement outlier scores for HDBSCAN #71

azizkayumov · 2024-07-19T06:41:11Z

HDBSCAN's outlier score computation algorithm, GLOSH (Global-Local Outlier Score from Hierarchies), seems to be a great addition, as we are interested in flat clustering results more than clustering hierarchy in this library.

Python HDBSCAN has this feature, but their implementation seems to be a little buggy. Are you interested in adding this feature to HDBSCAN? I would like PR this, but scared of code coverage for now, so let me test it a bit and open PR if you're interested.

msk · 2024-07-19T21:13:32Z

I'd be happy to review a PR for this feature. Don't be too concerned about code coverage at the initial PR stage. Feel free to open the PR when you're ready, and we can iterate on it.

azizkayumov · 2024-07-21T13:46:38Z

Firstly, let me briefly describe how to compute outlier scores from the HDBSCAN* paper, which is GLOSH (Global-Local Outlier Score from Hierarchies) or Equation 8 from the paper below.

To compute the outlier score of a data object x:

Find the first cluster C that x gets attached to in the hierarchy:
eps_x = the lowest eps that x belongs to C.
Find the lowest eps C:
eps_C = the lowest eps that C or any of C's child clusters survives w.r.t. min_cluster_size.
Then the outlier score of x is defined as:
score(x) = 1 - eps_C / eps_x

In the source code, we work with density threshold values (lambdas) instead of eps values.
Given that lambda = 1/eps, then we get the following:

lambda_x = the highest lambda that x belongs to C.
lambda_C = the highest lambda that C or any of C's child clusters survives w.r.t. min_cluster_size.

Finally, we get the following:
score(x) = 1 - lambda_x / lambda_C

For all data objects, we can compute outlier scores in O(n) time, if we keep track of lambda_C values when we build the clustering hierarchy.

azizkayumov · 2024-07-21T14:15:36Z

When computing lambda_C value of the cluster C, Python HDBSCAN seems to ignore the cluster C's child clusters, so it computes buggy outlier scores. The following plot is the output of Python HDBSCAN.
Please notice the location of the outlier objects with ranks 1 and 2, they are between two big clusters, even on their worst day, they should be less outlier than e.g. data objects ranked 22 or 23.

Compared to the Rust HDBSCAN, here is the better visualization of rankings (I hope they are correctly computed). As data objects get closer and closer to the clusters (w.r.t. min_cluster_size), they naturally become inliers:

azizkayumov · 2024-07-21T14:23:55Z

To reproduce the Python HDBSCAN's bug, please run the following:

import hdbscan
import numpy as np
import matplotlib.pyplot as plt


# Step 1: Generate the data
data = []
centers = [[15, 37], [30, 68], [38, 85]]
radiuses = [5, 5, 1]
counts = [100, 100, 100]
np.random.seed(0)
for i in range(3):
    x0, y0 = centers[i]
    r = radiuses[i]
    n = counts[i]
    for j in range(n):
        x = np.random.normal(x0, r)
        y = np.random.normal(y0, r)
        data.append([x, y])
num_noises = 30 # adding some uniform noise
for i in range(num_noises):
    x = np.random.uniform(0, 100)
    y = np.random.uniform(0, 100)
    data.append([x, y])


# Step 2: Compute the outlier scores
k = 15
clusterer = hdbscan.HDBSCAN(
    alpha=1.0, 
    approx_min_span_tree=False,
    gen_min_span_tree=True,
    metric='euclidean', 
    min_cluster_size=k, 
    min_samples=k,
    allow_single_cluster=True,
    match_reference_implementation=True)
clusterer.fit(data)


# Step 3. Plot the data and the outlier scores
outlier_scores = clusterer.outlier_scores_
plt.scatter([x[0] for x in data], [x[1] for x in data], s=25, c=outlier_scores, cmap='viridis')
plt.colorbar()


# Step 4: Assign rankings and plot top 20 outliers
indices = [i for i in range(len(outlier_scores))]
indices.sort(key=lambda x: outlier_scores[x])
last_30 = indices[-30:]
last_30 = reversed(last_30)
for i, idx in enumerate(last_30):
    plt.text(data[idx][0], data[idx][1], str(i+1), fontsize=10, color='black')
plt.title("PyHDBSCAN: outlier scores & rankings (top 30)")
plt.show()

I tried to generate a dataset similar to the example data illustrated in HDBSCAN* paper (Figure 10) to run both Python and Rust HDBSCANs on the same dataset.

azizkayumov · 2024-07-23T11:49:20Z

The last commit includes a test case that can be visualized easily and also Python HDBSCAN would fail to correctly rank outliers. Comparing to the output of this PR, the outliers are correctly ranked:

The code coverage is complaining about some missing parts, otherwise, please let me know if I can clarify further.

azizkayumov added a commit to azizkayumov/petal-clustering that referenced this issue Jul 20, 2024

glosh: closes petabi#71

5b6c0df

azizkayumov mentioned this issue Jul 21, 2024

Outlier scores for HDBSCAN #73

Merged

msk closed this as completed in #73 Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement outlier scores for HDBSCAN #71

Implement outlier scores for HDBSCAN #71

azizkayumov commented Jul 19, 2024 •

edited

Loading

msk commented Jul 19, 2024

azizkayumov commented Jul 21, 2024

azizkayumov commented Jul 21, 2024

azizkayumov commented Jul 21, 2024

azizkayumov commented Jul 23, 2024

Implement outlier scores for HDBSCAN #71

Implement outlier scores for HDBSCAN #71

Comments

azizkayumov commented Jul 19, 2024 • edited Loading

msk commented Jul 19, 2024

azizkayumov commented Jul 21, 2024

azizkayumov commented Jul 21, 2024

azizkayumov commented Jul 21, 2024

azizkayumov commented Jul 23, 2024

azizkayumov commented Jul 19, 2024 •

edited

Loading