Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add branch detection functionality #32

Merged
merged 11 commits into from
Jan 7, 2025

Conversation

JelmerBot
Copy link
Collaborator

I found some time to implement the branch detection functionality for this repository as well. This PR adds two new files branches.py and core_graph.py. The latter implements re-usable functions that compute hdbscan-like clusters from a kNN--MST union graph with edges weighted by given data-point lens values. The former uses those functions to compute branches with a per-cluster eccentricity lens.

Like #667 on the Cython implementation, this PR adds a persistence threshold to the hdbscan interface and supports overriding the cluster labels in the BranchDetector. In addition, I simplified a couple of functions without changing their function. Finally, I added a compute_minimum_spanning tree entry point so future external projects can use the sequence compute_minimum_spanning_tree, compute arbitrary lens value, core_graph_clusters to extract clusters on arbitrary functions under core distance connectivity.

Feel free to let me know if things need to change to better fit the project. I'll add review comments to explain the most important code changes below.

Comment on lines -7 to +22
@numba.njit(locals={"i": numba.types.int64})
def merge_components(disjoint_set, candidate_neighbors, candidate_neighbor_distances, point_components):
component_edges = {np.int64(0): (np.int64(0), np.int64(1), np.float32(0.0)) for i in range(0)}

@numba.njit(locals={"parent": numba.types.int32})
def select_components(candidate_distances, candidate_neighbors, point_components):
component_edges = {np.int64(0): (np.int32(0), np.int32(1), np.float32(0.0)) for i in range(0)}

# Find the best edges from each component
for i in range(candidate_neighbors.shape[0]):
from_component = np.int64(point_components[i])
for parent, (distance, neighbor, from_component) in enumerate(
zip(candidate_distances, candidate_neighbors, point_components)
):
if from_component in component_edges:
if candidate_neighbor_distances[i] < component_edges[from_component][2]:
component_edges[from_component] = (np.int64(i), np.int64(candidate_neighbors[i]), candidate_neighbor_distances[i])
if distance < component_edges[from_component][2]:
component_edges[from_component] = (parent, neighbor, distance)
else:
component_edges[from_component] = (np.int64(i), np.int64(candidate_neighbors[i]), candidate_neighbor_distances[i])
component_edges[from_component] = (parent, neighbor, distance)

return component_edges
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split merge_components into select_components and merge_components so I can re-use the latter part to compute the minimum spanning tree of a knn--mst union graph. That process needs to reject invalid neighbour values within the select_components part.

Comment on lines -37 to +50
def update_component_vectors(tree, disjoint_set, node_components, point_components):
def update_point_components(disjoint_set, point_components):
for i in numba.prange(point_components.shape[0]):
point_components[i] = ds_find(disjoint_set, np.int32(i))


@numba.njit()
def update_node_components(tree, node_components, point_components):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I split update_component_vectors into update_point_components and update_node_components so I can re-use the former part.


edges = np.vstack(edges)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to call vstack only once, avoiding multiple allocations.

Comment on lines 286 to 305
edges = initialize_boruvka_from_knn(neighbors, distances, core_distances, components_disjoint_set)
update_component_vectors(tree, components_disjoint_set, node_components, point_components)

while n_components > 1:
edges = [np.empty((0, 3), dtype=np.float64) for _ in range(0)]
new_edges = initialize_boruvka_from_knn(neighbors, core_distances, components_disjoint_set)
while True:
edges.append(new_edges)
update_point_components(components_disjoint_set, point_components)
update_node_components(tree, node_components, point_components)
if np.unique(point_components).shape[0] == 1:
break

candidate_distances, candidate_indices = boruvka_tree_query(tree, node_components, point_components,
core_distances)
new_edges = merge_components(components_disjoint_set, candidate_indices, candidate_distances, point_components)
update_component_vectors(tree, components_disjoint_set, node_components, point_components)

edges = np.vstack((edges, new_edges))
n_components = np.unique(point_components).shape[0]
component_edges = select_components(candidate_distances, candidate_indices, point_components)
new_edges = merge_components(components_disjoint_set, component_edges)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to avoid duplicate lines

Comment on lines +164 to +167
def empty_condensed_tree():
parents = np.empty(shape=0, dtype=np.intp)
others = np.empty(shape=0, dtype=np.float32)
return CondensedTree(parents, parents, others, others)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper function used to signal that an overridden cluster contains multiple connected components.

Comment on lines 253 to 264
def extract_leaves(condensed_tree, allow_single_cluster=True):
n_nodes = condensed_tree.parent.max() + 1
n_points = condensed_tree.parent.min()
leaf_indicator = np.ones(n_nodes, dtype=np.bool_)
leaf_indicator[:n_points] = False

for parent, child_size in zip(condensed_tree.parent, condensed_tree.child_size):
if child_size > 1:
leaf_indicator[parent] = False

return np.nonzero(leaf_indicator)[0]

def extract_leaves(cluster_tree, n_points):
n_nodes = cluster_tree.child.max() + 1
leaf_indicator = np.ones(n_nodes - n_points, dtype=np.bool_)
leaf_indicator[cluster_tree.parent - n_points] = False
return np.nonzero(leaf_indicator)[0] + n_points
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifies the function and saves some memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to manage to keep the same API on this if possible as I use it in other projects.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added the cluster-tree version along side the condensed-tree version.

Comment on lines -540 to +626
eps = 1 / cluster_tree.lambda_val[cluster_tree.child == cluster][0]
if eps < min_persistence:
idx = np.searchsorted(cluster_tree.child, cluster)
death_eps = 1 / cluster_tree.lambda_val[idx]
if death_eps < min_epsilon:
if cluster not in processed:
parent = traverse_upwards(cluster_tree, min_persistence, root, cluster)
parent = traverse_upwards(cluster_tree, min_epsilon, root, cluster)
selected.append(parent)
processed |= segments_in_branch(cluster_tree, parent)
processed |= segments_in_branch(parents, children, parent)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename persistence to epsilon to avoid confusion.

Comment on lines +645 to +664
def segments_in_branch(parents, children, segment):
# only way to create a typed empty set
result = {np.intp(0)}
child_set = {np.int64(0)}
result = {np.int64(0)}
result.clear()
to_process = {segment}

while len(to_process) > 0:
result |= to_process
to_process = set(cluster_tree.child[
in_set_parallel(cluster_tree.parent, to_process)
])

child_set.clear()
for segment in to_process:
idx = np.searchsorted(parents, segment)
if idx >= len(parents):
continue
child_set.add(children[idx])
child_set.add(children[idx + 1])

to_process.clear()
to_process |= child_set
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential binary search rather than parallel linear searches.

Comment on lines -176 to +185
sklearn_tree = KDTree(data)
numba_tree = kdtree_to_numba(sklearn_tree)
edges = parallel_boruvka(
numba_tree,
minimum_spanning_tree, neighbors, core_distances = compute_minimum_spanning_tree(
data,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puts the MST construction code in a function.

Comment on lines -84 to +83
labels, p, ltree, ctree, mtree = fast_hdbscan(X, return_trees=True)
labels, p, ltree, ctree, mtree, neighbors, cdists = fast_hdbscan(X, return_trees=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now also return the neighbours and core distances from fast_hdbscan, but only when return_trees=True. Would that be considered a breaking API change? There was only a single call site in the tests that needed to be updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that's that serious, as long as the tests are updated and documentation gets updated to explain how it works now.

@JelmerBot JelmerBot force-pushed the dev/branches branch 2 times, most recently from 489084b to ca7b56c Compare December 24, 2024 15:07
@JelmerBot JelmerBot marked this pull request as draft December 27, 2024 08:05
@JelmerBot JelmerBot force-pushed the dev/branches branch 2 times, most recently from 388590b to 33628aa Compare December 28, 2024 15:06
@lmcinnes
Copy link
Contributor

Let me know when this is ready for a full review.

@@ -486,7 +502,7 @@ def unselect_below_node(node, cluster_tree, selected_clusters):

@numba.njit(fastmath=True)
def eom_recursion(node, cluster_tree, node_scores, node_sizes, selected_clusters, max_cluster_size):
current_score = node_scores[node]
current_score = max(node_scores[node], 0.0) # floating point errors can make score negative!
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes dictionary lookup error when assigning labels in cases where EOM selects only one child of the root because the other has a negative score due to floating point errors.

Comment on lines -315 to +343
clean_data_labels = y

if self.semi_supervised:
clean_data_labels = y[finite_index]
clean_data_labels = y[finite_index] if self.semi_supervised else None
sample_weight = sample_weight[finite_index] if sample_weight is not None else None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_weight was not updated with finite_index before.

@JelmerBot
Copy link
Collaborator Author

This should also be ready now. I added a "for developers" section to the docs trying to explain how the branch detection functionality is build, but I am not sure about the text right now. Feel free to improve it or exclude it from the docs if it does not belong there.

@JelmerBot JelmerBot marked this pull request as ready for review January 3, 2025 12:10
@lmcinnes lmcinnes merged commit 2f83774 into TutteInstitute:main Jan 7, 2025
13 checks passed
@lmcinnes
Copy link
Contributor

lmcinnes commented Jan 7, 2025

Thanks for all the work. I do especially appreciate the effort to make internals re-usable (which was one of the intents of this project -- something a little cleaner and decomposable to work with). Hopefully you can make use of this code accordingly. And the for developers documentation looks good. I think it outlines the right entry points for re-use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants