-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Branch detection updates #667
base: master
Are you sure you want to change the base?
Branch detection updates #667
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
class BranchDetectionData(object): | ||
"""Input data for branch detection functionality. | ||
|
||
Recreates and caches internal data structures from the clustering stage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved BranchDetectionData
to a new file so branches.py
can import hdbscan_.py
without cyclical imports.
self._finite_index = get_finite_row_indices(X) | ||
clean_data = X[self._finite_index] | ||
finite_index = get_finite_row_indices(X) | ||
clean_data = X[finite_index] | ||
internal_to_raw = { | ||
x: y for x, y in zip(range(len(self._finite_index)), self._finite_index) | ||
x: y for x, y in zip(range(len(finite_index)), finite_index) | ||
} | ||
outliers = list(set(range(X.shape[0])) - set(self._finite_index)) | ||
outliers = list(set(range(X.shape[0])) - set(finite_index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now recover the finite index from the condensed tree, so finite_index
does not have to be stored explicitly anymore. This reverts changes I made when I introduced the branch detection code.
assert_array_almost_equal, | ||
assert_raises, | ||
assert_array_almost_equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_raises
gives import error on CI/CD. I replaced it with pytest.raises
in all tests.
4dfbe7b
to
95a70f3
Compare
95a70f3
to
3ebc8c5
Compare
This looks great. Let me know when you are ready to have it merged. |
I think this is ready now. It contains breaking name changes for the |
This PR solves #660, adding
labels
andprobability
parameters toBranchDetector.fit()
that override the input HDBSCAN object. Cases where overridden clusters form multiple connected components in the minimum spanning tree are detected. The component labels are returned as branch labels in that case. The condensed and linkage trees for those clusters are set to None, allowing scripts to detect what happened.While working on this PR, I noticed that the branching code could be simplified extensively. This PR revert some of the changes I made when I introduced the branch detection code. I also found and fixed some issues with the hierarchy simplification code that applies a persistence threshold and added a persistence threshold parameter to the clustering code.
Finally, I made small changes in
_hdbscan_boruvka.pyx
to expose the computed core distances and neighbours. This allows me to use the implementation in another project.