Skip to content

Refactor beeswarm #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 12, 2025
Merged

Refactor beeswarm #254

merged 14 commits into from
Jun 12, 2025

Conversation

cvanelteren
Copy link
Contributor

This PR is a major rework of the beeswarm functionality that was recently merged to main. While testing the newly added implementation with examples from the SHAP library, I discovered that our current approach doesn't integrate well with their plotting workflows and becomes computationally expensive for larger datasets. The former implementation processed each point individually to reduce overlap, which scales poorly as dataset size increases.

This rework improves performance by vectorizing the overlap reduction function, replacing the previous O(n²) per-point approach with vectorized operations that scale better with large datasets. Our implementation remains intentionally barebones compared to the SHAP library's more sophisticated feature clustering capabilities. For users requiring advanced clustering features, we recommend the specialized SHAP library, which can achieve similar visual results with additional functionality.

The docstring is also corrected to be in line with the signature of the function. Lastly, the tests are updated to reflect the changes in the api and rely more on internal functions.

image

snippet
import ultraplot as uplt, numpy as np
import xgboost, shap

# # # train XGBoost model
X, y = shap.datasets.adult()
model = xgboost.XGBClassifier().fit(X, y)

# # # compute SHAP values
explainer = shap.Explainer(model, X)
shap_values = explainer(X)
# %%
print(shap_values.data.shape)
c = shap_values.data
c = (c - c.min(0)[None]) / (c.max(0) - c.min(0))[None]

fig, ax = uplt.subplots()
ax.beeswarm(shap_values.values, feature_values=c, cmap="burd", ss=1)
ax.format(
    xlabel="SHAP values",
    yticks=np.arange(c.shape[1]),
    yticklabels=shap_values.feature_names,
)

uplt.show(block=1)

@cvanelteren cvanelteren requested a review from Copilot June 11, 2025 01:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the beeswarm functionality to improve performance and scalability by replacing the per-point collision detection with a vectorized, histogram-based overlap reduction technique. Key changes include a revised API for beeswarm plotting, updated documentation reflecting the new parameters, and modified tests to validate the new implementation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
ultraplot/tests/test_1dplots.py Updated test inputs to match the new vectorized beeswarm API
ultraplot/axes/plot.py Refactored beeswarm implementation and docstring with new parameters, including vectorized binning and parameter renaming
Comments suppressed due to low confidence (1)

ultraplot/axes/plot.py:3510

  • The marker size parameter is named 'ss' in _apply_beeswarm, yet the tests provide a 'sizes' argument. Please standardize the parameter naming and update the documentation to ensure consistency across the API.
ss: float | np.ndarray = None,

@cvanelteren
Copy link
Contributor Author

For reference this is the plot generated from the same dataset in the shap library:

image

@cvanelteren cvanelteren requested a review from beckermr June 11, 2025 01:08
@cvanelteren cvanelteren added this to the v1.52 milestone Jun 11, 2025
@cvanelteren cvanelteren self-assigned this Jun 11, 2025
@cvanelteren cvanelteren added the enhancement New feature or request label Jun 11, 2025
@cvanelteren
Copy link
Contributor Author

cvanelteren commented Jun 11, 2025

Note to self I may want to change the max lineage to be slightly less than half the level width.

Copy link
Collaborator

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks ok but the unit tests are failing.

@cvanelteren
Copy link
Contributor Author

Unittests are a bit different visually but act similarly.

@cvanelteren
Copy link
Contributor Author

That is, I was not expecting them to behave the same as on main I computed the colours based on the height, which is now handled implicitly through scatter functionality.

@beckermr
Copy link
Collaborator

No there is an actual error

FAILED ultraplot/tests/test_1dplots.py::test_beeswarm - AttributeError: PathCollection.set() got an unexpected keyword argument 'size'

@cvanelteren
Copy link
Contributor Author

Ah ok will fix.

Copy link

codecov bot commented Jun 12, 2025

Codecov Report

Attention: Patch coverage is 96.15385% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
ultraplot/axes/plot.py 95.12% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@cvanelteren
Copy link
Contributor Author

Should be good now. When merged I will draft a release.

@cvanelteren cvanelteren requested a review from beckermr June 12, 2025 09:17
Copy link
Collaborator

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is definitely not working with the new code. See the plot difference:

Screenshot 2025-06-12 at 9 31 39 AM

@cvanelteren
Copy link
Contributor Author

I changed the test. The colors were first inferred and now delegated to scatter so it follow the prop cycle.

@cvanelteren
Copy link
Contributor Author

Will need to change the bottom left but the rest looks fine to me.

@beckermr
Copy link
Collaborator

Did you check that the new and old code give consistent results? Changing the code and the test means that we cannot tell.

@cvanelteren
Copy link
Contributor Author

I can check it; the reason why i changed is to ensure that the prop cycle is used as it's more consistent with the rest of the api. I can temp change it to confirm.

@cvanelteren
Copy link
Contributor Author

image
The test is visualized next to the mocked top left plot using this PR. The original has slightly more jitter since we were modifying the x and y value randomly from the current position, here we just modify the x or y value (depending on the orientation, for this example just x).

I will adjust the test to mimic the original and call it a day.

@cvanelteren cvanelteren enabled auto-merge (squash) June 12, 2025 18:09
@beckermr beckermr disabled auto-merge June 12, 2025 19:34
@beckermr beckermr merged commit 44df33a into Ultraplot:main Jun 12, 2025
10 of 14 checks passed
@cvanelteren
Copy link
Contributor Author

Drafting a release now introducing the new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants