-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add branch detection functionality #32
Conversation
@numba.njit(locals={"i": numba.types.int64}) | ||
def merge_components(disjoint_set, candidate_neighbors, candidate_neighbor_distances, point_components): | ||
component_edges = {np.int64(0): (np.int64(0), np.int64(1), np.float32(0.0)) for i in range(0)} | ||
|
||
@numba.njit(locals={"parent": numba.types.int32}) | ||
def select_components(candidate_distances, candidate_neighbors, point_components): | ||
component_edges = {np.int64(0): (np.int32(0), np.int32(1), np.float32(0.0)) for i in range(0)} | ||
|
||
# Find the best edges from each component | ||
for i in range(candidate_neighbors.shape[0]): | ||
from_component = np.int64(point_components[i]) | ||
for parent, (distance, neighbor, from_component) in enumerate( | ||
zip(candidate_distances, candidate_neighbors, point_components) | ||
): | ||
if from_component in component_edges: | ||
if candidate_neighbor_distances[i] < component_edges[from_component][2]: | ||
component_edges[from_component] = (np.int64(i), np.int64(candidate_neighbors[i]), candidate_neighbor_distances[i]) | ||
if distance < component_edges[from_component][2]: | ||
component_edges[from_component] = (parent, neighbor, distance) | ||
else: | ||
component_edges[from_component] = (np.int64(i), np.int64(candidate_neighbors[i]), candidate_neighbor_distances[i]) | ||
component_edges[from_component] = (parent, neighbor, distance) | ||
|
||
return component_edges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split merge_components
into select_components
and merge_components
so I can re-use the latter part to compute the minimum spanning tree of a knn--mst union graph. That process needs to reject invalid neighbour values within the select_components
part.
def update_component_vectors(tree, disjoint_set, node_components, point_components): | ||
def update_point_components(disjoint_set, point_components): | ||
for i in numba.prange(point_components.shape[0]): | ||
point_components[i] = ds_find(disjoint_set, np.int32(i)) | ||
|
||
|
||
@numba.njit() | ||
def update_node_components(tree, node_components, point_components): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I split update_component_vectors
into update_point_components
and update_node_components
so I can re-use the former part.
|
||
edges = np.vstack(edges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor to call vstack
only once, avoiding multiple allocations.
fast_hdbscan/boruvka.py
Outdated
edges = initialize_boruvka_from_knn(neighbors, distances, core_distances, components_disjoint_set) | ||
update_component_vectors(tree, components_disjoint_set, node_components, point_components) | ||
|
||
while n_components > 1: | ||
edges = [np.empty((0, 3), dtype=np.float64) for _ in range(0)] | ||
new_edges = initialize_boruvka_from_knn(neighbors, core_distances, components_disjoint_set) | ||
while True: | ||
edges.append(new_edges) | ||
update_point_components(components_disjoint_set, point_components) | ||
update_node_components(tree, node_components, point_components) | ||
if np.unique(point_components).shape[0] == 1: | ||
break | ||
|
||
candidate_distances, candidate_indices = boruvka_tree_query(tree, node_components, point_components, | ||
core_distances) | ||
new_edges = merge_components(components_disjoint_set, candidate_indices, candidate_distances, point_components) | ||
update_component_vectors(tree, components_disjoint_set, node_components, point_components) | ||
|
||
edges = np.vstack((edges, new_edges)) | ||
n_components = np.unique(point_components).shape[0] | ||
component_edges = select_components(candidate_distances, candidate_indices, point_components) | ||
new_edges = merge_components(components_disjoint_set, component_edges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor to avoid duplicate lines
def empty_condensed_tree(): | ||
parents = np.empty(shape=0, dtype=np.intp) | ||
others = np.empty(shape=0, dtype=np.float32) | ||
return CondensedTree(parents, parents, others, others) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper function used to signal that an overridden cluster contains multiple connected components.
fast_hdbscan/cluster_trees.py
Outdated
def extract_leaves(condensed_tree, allow_single_cluster=True): | ||
n_nodes = condensed_tree.parent.max() + 1 | ||
n_points = condensed_tree.parent.min() | ||
leaf_indicator = np.ones(n_nodes, dtype=np.bool_) | ||
leaf_indicator[:n_points] = False | ||
|
||
for parent, child_size in zip(condensed_tree.parent, condensed_tree.child_size): | ||
if child_size > 1: | ||
leaf_indicator[parent] = False | ||
|
||
return np.nonzero(leaf_indicator)[0] | ||
|
||
def extract_leaves(cluster_tree, n_points): | ||
n_nodes = cluster_tree.child.max() + 1 | ||
leaf_indicator = np.ones(n_nodes - n_points, dtype=np.bool_) | ||
leaf_indicator[cluster_tree.parent - n_points] = False | ||
return np.nonzero(leaf_indicator)[0] + n_points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifies the function and saves some memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to manage to keep the same API on this if possible as I use it in other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I added the cluster-tree version along side the condensed-tree version.
eps = 1 / cluster_tree.lambda_val[cluster_tree.child == cluster][0] | ||
if eps < min_persistence: | ||
idx = np.searchsorted(cluster_tree.child, cluster) | ||
death_eps = 1 / cluster_tree.lambda_val[idx] | ||
if death_eps < min_epsilon: | ||
if cluster not in processed: | ||
parent = traverse_upwards(cluster_tree, min_persistence, root, cluster) | ||
parent = traverse_upwards(cluster_tree, min_epsilon, root, cluster) | ||
selected.append(parent) | ||
processed |= segments_in_branch(cluster_tree, parent) | ||
processed |= segments_in_branch(parents, children, parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename persistence to epsilon to avoid confusion.
def segments_in_branch(parents, children, segment): | ||
# only way to create a typed empty set | ||
result = {np.intp(0)} | ||
child_set = {np.int64(0)} | ||
result = {np.int64(0)} | ||
result.clear() | ||
to_process = {segment} | ||
|
||
while len(to_process) > 0: | ||
result |= to_process | ||
to_process = set(cluster_tree.child[ | ||
in_set_parallel(cluster_tree.parent, to_process) | ||
]) | ||
|
||
child_set.clear() | ||
for segment in to_process: | ||
idx = np.searchsorted(parents, segment) | ||
if idx >= len(parents): | ||
continue | ||
child_set.add(children[idx]) | ||
child_set.add(children[idx + 1]) | ||
|
||
to_process.clear() | ||
to_process |= child_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sequential binary search rather than parallel linear searches.
sklearn_tree = KDTree(data) | ||
numba_tree = kdtree_to_numba(sklearn_tree) | ||
edges = parallel_boruvka( | ||
numba_tree, | ||
minimum_spanning_tree, neighbors, core_distances = compute_minimum_spanning_tree( | ||
data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Puts the MST construction code in a function.
labels, p, ltree, ctree, mtree = fast_hdbscan(X, return_trees=True) | ||
labels, p, ltree, ctree, mtree, neighbors, cdists = fast_hdbscan(X, return_trees=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now also return the neighbours and core distances from fast_hdbscan
, but only when return_trees=True
. Would that be considered a breaking API change? There was only a single call site in the tests that needed to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think that's that serious, as long as the tests are updated and documentation gets updated to explain how it works now.
489084b
to
ca7b56c
Compare
388590b
to
33628aa
Compare
33628aa
to
a3a5756
Compare
Let me know when this is ready for a full review. |
6f235c2
to
f932878
Compare
@@ -486,7 +502,7 @@ def unselect_below_node(node, cluster_tree, selected_clusters): | |||
|
|||
@numba.njit(fastmath=True) | |||
def eom_recursion(node, cluster_tree, node_scores, node_sizes, selected_clusters, max_cluster_size): | |||
current_score = node_scores[node] | |||
current_score = max(node_scores[node], 0.0) # floating point errors can make score negative! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes dictionary lookup error when assigning labels in cases where EOM selects only one child of the root because the other has a negative score due to floating point errors.
clean_data_labels = y | ||
|
||
if self.semi_supervised: | ||
clean_data_labels = y[finite_index] | ||
clean_data_labels = y[finite_index] if self.semi_supervised else None | ||
sample_weight = sample_weight[finite_index] if sample_weight is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_weight was not updated with finite_index
before.
This should also be ready now. I added a "for developers" section to the docs trying to explain how the branch detection functionality is build, but I am not sure about the text right now. Feel free to improve it or exclude it from the docs if it does not belong there. |
Thanks for all the work. I do especially appreciate the effort to make internals re-usable (which was one of the intents of this project -- something a little cleaner and decomposable to work with). Hopefully you can make use of this code accordingly. And the for developers documentation looks good. I think it outlines the right entry points for re-use. |
I found some time to implement the branch detection functionality for this repository as well. This PR adds two new files
branches.py
andcore_graph.py
. The latter implements re-usable functions that compute hdbscan-like clusters from a kNN--MST union graph with edges weighted by given data-point lens values. The former uses those functions to compute branches with a per-cluster eccentricity lens.Like #667 on the Cython implementation, this PR adds a persistence threshold to the hdbscan interface and supports overriding the cluster labels in the
BranchDetector
. In addition, I simplified a couple of functions without changing their function. Finally, I added acompute_minimum_spanning
tree entry point so future external projects can use the sequencecompute_minimum_spanning_tree
, compute arbitrary lens value,core_graph_clusters
to extract clusters on arbitrary functions under core distance connectivity.Feel free to let me know if things need to change to better fit the project. I'll add review comments to explain the most important code changes below.