-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement outlier scores for HDBSCAN #71
Comments
I'd be happy to review a PR for this feature. Don't be too concerned about code coverage at the initial PR stage. Feel free to open the PR when you're ready, and we can iterate on it. |
Firstly, let me briefly describe how to compute outlier scores from the HDBSCAN* paper, which is GLOSH (Global-Local Outlier Score from Hierarchies) or Equation 8 from the paper below. To compute the outlier score of a data object
In the source code, we work with density threshold values (lambdas) instead of eps values.
Finally, we get the following: For all data objects, we can compute outlier scores in O(n) time, if we keep track of |
To reproduce the Python HDBSCAN's bug, please run the following:
I tried to generate a dataset similar to the example data illustrated in HDBSCAN* paper (Figure 10) to run both Python and Rust HDBSCANs on the same dataset. |
The last commit includes a test case that can be visualized easily and also Python HDBSCAN would fail to correctly rank outliers. Comparing to the output of this PR, the outliers are correctly ranked: |
HDBSCAN's outlier score computation algorithm, GLOSH (Global-Local Outlier Score from Hierarchies), seems to be a great addition, as we are interested in flat clustering results more than clustering hierarchy in this library.
Python HDBSCAN has this feature, but their implementation seems to be a little buggy. Are you interested in adding this feature to HDBSCAN? I would like PR this, but scared of code coverage for now, so let me test it a bit and open PR if you're interested.
The text was updated successfully, but these errors were encountered: