Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch detection updates #667

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

JelmerBot
Copy link
Contributor

This PR solves #660, adding labels and probability parameters to BranchDetector.fit() that override the input HDBSCAN object. Cases where overridden clusters form multiple connected components in the minimum spanning tree are detected. The component labels are returned as branch labels in that case. The condensed and linkage trees for those clusters are set to None, allowing scripts to detect what happened.

While working on this PR, I noticed that the branching code could be simplified extensively. This PR revert some of the changes I made when I introduced the branch detection code. I also found and fixed some issues with the hierarchy simplification code that applies a persistence threshold and added a persistence threshold parameter to the clustering code.

Finally, I made small changes in _hdbscan_boruvka.pyx to expose the computed core distances and neighbours. This allows me to use the implementation in another project.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Comment on lines +6 to +10
class BranchDetectionData(object):
"""Input data for branch detection functionality.

Recreates and caches internal data structures from the clustering stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved BranchDetectionData to a new file so branches.py can import hdbscan_.py without cyclical imports.

Comment on lines -1218 to +1245
self._finite_index = get_finite_row_indices(X)
clean_data = X[self._finite_index]
finite_index = get_finite_row_indices(X)
clean_data = X[finite_index]
internal_to_raw = {
x: y for x, y in zip(range(len(self._finite_index)), self._finite_index)
x: y for x, y in zip(range(len(finite_index)), finite_index)
}
outliers = list(set(range(X.shape[0])) - set(self._finite_index))
outliers = list(set(range(X.shape[0])) - set(finite_index))
Copy link
Contributor Author

@JelmerBot JelmerBot Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now recover the finite index from the condensed tree, so finite_index does not have to be stored explicitly anymore. This reverts changes I made when I introduced the branch detection code.

assert_array_almost_equal,
assert_raises,
assert_array_almost_equal
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_raises gives import error on CI/CD. I replaced it with pytest.raises in all tests.

@lmcinnes
Copy link
Collaborator

This looks great. Let me know when you are ready to have it merged.

@JelmerBot JelmerBot marked this pull request as ready for review January 3, 2025 11:24
@JelmerBot
Copy link
Contributor Author

I think this is ready now. It contains breaking name changes for the BranchDetector, but that makes the naming consistent across repositories and these names make more sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants