add branch detection functionality #32

JelmerBot · 2024-12-23T18:05:19Z

I found some time to implement the branch detection functionality for this repository as well. This PR adds two new files branches.py and core_graph.py. The latter implements re-usable functions that compute hdbscan-like clusters from a kNN--MST union graph with edges weighted by given data-point lens values. The former uses those functions to compute branches with a per-cluster eccentricity lens.

Like #667 on the Cython implementation, this PR adds a persistence threshold to the hdbscan interface and supports overriding the cluster labels in the BranchDetector. In addition, I simplified a couple of functions without changing their function. Finally, I added a compute_minimum_spanning tree entry point so future external projects can use the sequence compute_minimum_spanning_tree, compute arbitrary lens value, core_graph_clusters to extract clusters on arbitrary functions under core distance connectivity.

Feel free to let me know if things need to change to better fit the project. I'll add review comments to explain the most important code changes below.

JelmerBot · 2024-12-23T18:08:48Z

fast_hdbscan/boruvka.py

-@numba.njit(locals={"i": numba.types.int64})
-def merge_components(disjoint_set, candidate_neighbors, candidate_neighbor_distances, point_components):
-    component_edges = {np.int64(0): (np.int64(0), np.int64(1), np.float32(0.0)) for i in range(0)}
+
+@numba.njit(locals={"parent": numba.types.int32})
+def select_components(candidate_distances, candidate_neighbors, point_components):
+    component_edges = {np.int64(0): (np.int32(0), np.int32(1), np.float32(0.0)) for i in range(0)}

    # Find the best edges from each component
-    for i in range(candidate_neighbors.shape[0]):
-        from_component = np.int64(point_components[i])
+    for parent, (distance, neighbor, from_component) in enumerate(
+        zip(candidate_distances, candidate_neighbors, point_components)
+    ):
        if from_component in component_edges:
-            if candidate_neighbor_distances[i] < component_edges[from_component][2]:
-                component_edges[from_component] = (np.int64(i), np.int64(candidate_neighbors[i]), candidate_neighbor_distances[i])
+            if distance < component_edges[from_component][2]:
+                component_edges[from_component] = (parent, neighbor, distance)
        else:
-            component_edges[from_component] = (np.int64(i), np.int64(candidate_neighbors[i]), candidate_neighbor_distances[i])
+            component_edges[from_component] = (parent, neighbor, distance)
+
+    return component_edges


I split merge_components into select_components and merge_components so I can re-use the latter part to compute the minimum spanning tree of a knn--mst union graph. That process needs to reject invalid neighbour values within the select_components part.

JelmerBot · 2024-12-23T18:09:42Z

fast_hdbscan/boruvka.py

-def update_component_vectors(tree, disjoint_set, node_components, point_components):
+def update_point_components(disjoint_set, point_components):
    for i in numba.prange(point_components.shape[0]):
        point_components[i] = ds_find(disjoint_set, np.int32(i))

+
+@numba.njit()
+def update_node_components(tree, node_components, point_components):


Similarly, I split update_component_vectors into update_point_components and update_node_components so I can re-use the former part.

JelmerBot · 2024-12-23T18:11:45Z

fast_hdbscan/boruvka.py


+    edges = np.vstack(edges)


Refactor to call vstack only once, avoiding multiple allocations.

JelmerBot · 2024-12-23T18:12:02Z

fast_hdbscan/boruvka.py

-            edges = initialize_boruvka_from_knn(neighbors, distances, core_distances, components_disjoint_set)
-            update_component_vectors(tree, components_disjoint_set, node_components, point_components)

-    while n_components > 1:
+    edges = [np.empty((0, 3), dtype=np.float64) for _ in range(0)]
+    new_edges = initialize_boruvka_from_knn(neighbors, core_distances, components_disjoint_set)
+    while True:
+        edges.append(new_edges)
+        update_point_components(components_disjoint_set, point_components)
+        update_node_components(tree, node_components, point_components)
+        if np.unique(point_components).shape[0] == 1:
+            break
+
        candidate_distances, candidate_indices = boruvka_tree_query(tree, node_components, point_components,
                                                                    core_distances)
-        new_edges = merge_components(components_disjoint_set, candidate_indices, candidate_distances, point_components)
-        update_component_vectors(tree, components_disjoint_set, node_components, point_components)
-
-        edges = np.vstack((edges, new_edges))
-        n_components = np.unique(point_components).shape[0]
+        component_edges = select_components(candidate_distances, candidate_indices, point_components)
+        new_edges = merge_components(components_disjoint_set, component_edges)


Refactor to avoid duplicate lines

JelmerBot · 2024-12-23T18:12:59Z

fast_hdbscan/cluster_trees.py

+def empty_condensed_tree():
+    parents = np.empty(shape=0, dtype=np.intp)
+    others = np.empty(shape=0, dtype=np.float32)
+    return CondensedTree(parents, parents, others, others)


Helper function used to signal that an overridden cluster contains multiple connected components.

JelmerBot · 2024-12-23T18:14:42Z

fast_hdbscan/cluster_trees.py

-def extract_leaves(condensed_tree, allow_single_cluster=True):
-    n_nodes = condensed_tree.parent.max() + 1
-    n_points = condensed_tree.parent.min()
-    leaf_indicator = np.ones(n_nodes, dtype=np.bool_)
-    leaf_indicator[:n_points] = False
-
-    for parent, child_size in zip(condensed_tree.parent, condensed_tree.child_size):
-        if child_size > 1:
-            leaf_indicator[parent] = False
-
-    return np.nonzero(leaf_indicator)[0]
-
+def extract_leaves(cluster_tree, n_points):
+    n_nodes = cluster_tree.child.max() + 1
+    leaf_indicator = np.ones(n_nodes - n_points, dtype=np.bool_)
+    leaf_indicator[cluster_tree.parent - n_points] = False
+    return np.nonzero(leaf_indicator)[0] + n_points


Simplifies the function and saves some memory.

It would be helpful to manage to keep the same API on this if possible as I use it in other projects.

OK, I added the cluster-tree version along side the condensed-tree version.

JelmerBot · 2024-12-23T18:15:53Z

fast_hdbscan/cluster_trees.py

-        eps = 1 / cluster_tree.lambda_val[cluster_tree.child == cluster][0]
-        if eps < min_persistence:
+        idx = np.searchsorted(cluster_tree.child, cluster)
+        death_eps = 1 / cluster_tree.lambda_val[idx]
+        if death_eps < min_epsilon:
            if cluster not in processed:
-                parent = traverse_upwards(cluster_tree, min_persistence, root, cluster)
+                parent = traverse_upwards(cluster_tree, min_epsilon, root, cluster)
                selected.append(parent)
-                processed |= segments_in_branch(cluster_tree, parent)
+                processed |= segments_in_branch(parents, children, parent)


Rename persistence to epsilon to avoid confusion.

JelmerBot · 2024-12-23T18:19:21Z

fast_hdbscan/cluster_trees.py

+def segments_in_branch(parents, children, segment):
    # only way to create a typed empty set
-    result = {np.intp(0)}
+    child_set = {np.int64(0)}
+    result = {np.int64(0)}
    result.clear()
    to_process = {segment}

    while len(to_process) > 0:
        result |= to_process
-        to_process = set(cluster_tree.child[
-            in_set_parallel(cluster_tree.parent, to_process)
-        ])
+
+        child_set.clear()
+        for segment in to_process:
+            idx = np.searchsorted(parents, segment)
+            if idx >= len(parents):
+                continue
+            child_set.add(children[idx])
+            child_set.add(children[idx + 1])
+
+        to_process.clear()
+        to_process |= child_set


Sequential binary search rather than parallel linear searches.

JelmerBot · 2024-12-23T18:22:09Z

fast_hdbscan/hdbscan.py

-    sklearn_tree = KDTree(data)
-    numba_tree = kdtree_to_numba(sklearn_tree)
-    edges = parallel_boruvka(
-        numba_tree,
+    minimum_spanning_tree, neighbors, core_distances = compute_minimum_spanning_tree(
+        data,


Puts the MST construction code in a function.

JelmerBot · 2024-12-23T18:25:55Z

fast_hdbscan/tests/test_hdbscan.py

-    labels, p, ltree, ctree, mtree = fast_hdbscan(X, return_trees=True)
+    labels, p, ltree, ctree, mtree, neighbors, cdists = fast_hdbscan(X, return_trees=True)


I now also return the neighbours and core distances from fast_hdbscan, but only when return_trees=True. Would that be considered a breaking API change? There was only a single call site in the tests that needed to be updated.

No, I don't think that's that serious, as long as the tests are updated and documentation gets updated to explain how it works now.

lmcinnes · 2024-12-30T22:06:44Z

Let me know when this is ready for a full review.

JelmerBot · 2025-01-03T12:02:35Z

fast_hdbscan/cluster_trees.py

@@ -486,7 +502,7 @@ def unselect_below_node(node, cluster_tree, selected_clusters):

 @numba.njit(fastmath=True)
 def eom_recursion(node, cluster_tree, node_scores, node_sizes, selected_clusters, max_cluster_size):
-    current_score = node_scores[node]
+    current_score = max(node_scores[node], 0.0) # floating point errors can make score negative!


Fixes dictionary lookup error when assigning labels in cases where EOM selects only one child of the root because the other has a negative score due to floating point errors.

JelmerBot · 2025-01-03T12:04:35Z

fast_hdbscan/hdbscan.py

-            clean_data_labels = y
-
-            if self.semi_supervised:
-                clean_data_labels = y[finite_index]
+            clean_data_labels = y[finite_index] if self.semi_supervised else None
+            sample_weight = sample_weight[finite_index] if sample_weight is not None else None


sample_weight was not updated with finite_index before.

JelmerBot · 2025-01-03T12:10:36Z

This should also be ready now. I added a "for developers" section to the docs trying to explain how the branch detection functionality is build, but I am not sure about the text right now. Feel free to improve it or exclude it from the docs if it does not belong there.

lmcinnes · 2025-01-07T19:07:02Z

Thanks for all the work. I do especially appreciate the effort to make internals re-usable (which was one of the intents of this project -- something a little cleaner and decomposable to work with). Hopefully you can make use of this code accordingly. And the for developers documentation looks good. I think it outlines the right entry points for re-use.

JelmerBot commented Dec 23, 2024

View reviewed changes

fast_hdbscan/boruvka.py

edges = np.vstack(edges)

Copy link

Collaborator Author

JelmerBot Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to call vstack only once, avoiding multiple allocations.

JelmerBot commented Dec 23, 2024

View reviewed changes

JelmerBot force-pushed the dev/branches branch 2 times, most recently from 489084b to ca7b56c Compare December 24, 2024 15:07

JelmerBot marked this pull request as draft December 27, 2024 08:05

JelmerBot added 4 commits December 28, 2024 10:03

add persistence threshold to hdbscan

821ff98

add lensed clusters from knn-mst union

f411d5c

add general lensed sub-clusters within clusters

658948e

add branch detection

d40ab87

JelmerBot force-pushed the dev/branches branch 2 times, most recently from 388590b to 33628aa Compare December 28, 2024 15:06

JelmerBot added 3 commits December 29, 2024 10:28

add propagate sub cluster label option

70dff00

fix parallel dictionary crazyness

ff0b3a5

fix eom recursion; add fit_predict

a3a5756

JelmerBot force-pushed the dev/branches branch from 33628aa to a3a5756 Compare December 29, 2024 09:29

JelmerBot added 3 commits January 2, 2025 19:19

fixes; consistent naming; docs;

c66f5c1

update docs; clean propagate labels;

a4b846d

catch import error in tests

f932878

JelmerBot force-pushed the dev/branches branch from 6f235c2 to f932878 Compare January 3, 2025 11:53

JelmerBot commented Jan 3, 2025

View reviewed changes

JelmerBot marked this pull request as ready for review January 3, 2025 12:10

add missing parallel tag

1c5ff02

lmcinnes approved these changes Jan 7, 2025

View reviewed changes

lmcinnes merged commit 2f83774 into TutteInstitute:main Jan 7, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add branch detection functionality #32

add branch detection functionality #32

JelmerBot commented Dec 23, 2024

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

lmcinnes Dec 30, 2024

JelmerBot Jan 3, 2025

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

JelmerBot Dec 23, 2024

lmcinnes Dec 30, 2024

lmcinnes commented Dec 30, 2024

JelmerBot Jan 3, 2025

JelmerBot Jan 3, 2025

JelmerBot commented Jan 3, 2025

lmcinnes commented Jan 7, 2025

		labels, p, ltree, ctree, mtree = fast_hdbscan(X, return_trees=True)
		labels, p, ltree, ctree, mtree, neighbors, cdists = fast_hdbscan(X, return_trees=True)

add branch detection functionality #32

add branch detection functionality #32

Conversation

JelmerBot commented Dec 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmcinnes commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JelmerBot commented Jan 3, 2025

lmcinnes commented Jan 7, 2025