Skip to content

Conversation

@GrantPSpencer
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:
    ConcurrentModification errors can occur during waged rebalance due to async thread modifying the best possible while the main thread is iterating over it

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    This PR refactors how we access and store the best possible and baseline assignments.

Get - returns a deep copy, assignment can now be safely modified without affecting the cache
Set - creates a deep copy prior to storing - the passed in assignment can be safely modified without affecting the cache

The above changes are aimed at preventing unintended modification of the underlying ResourceAssignments

When a new assignment is persisted, we no longer clear the previous map and add the new map's value, but just point
_globalBaseline or _bestPossibleAssignment at the new map. This change is intended to isolate threads so they cannot modify the underlying structure that another thread is pointing at.

Stack trace of concurrentModification exception that can occur:

2025/03/24 20:00:06.938 ERROR [BestPossibleStateCalcStage] [HelixController-pipeline-default-CLUSTER_NAME-(39f24f68_DEFAULT)] [helix] [] Event 39f24f68_DEFAULT : Failed to calculate the new Ideal States using the rebalancer WagedRebalancer due to INVALID_REBALANCER_STATUS
org.apache.helix.HelixRebalanceException: Failed to get the current best possible assignment because of unexpected error. Failure Type: INVALID_REBALANCER_STATUS
        at org.apache.helix.controller.rebalancer.waged.AssignmentManager.getBestPossibleAssignment(AssignmentManager.java:98) ~[org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:456) ~[org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) ~[org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:445) [org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]         at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:289) [org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:94) [org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) [org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:905) [org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1556) [org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
Caused by: java.util.ConcurrentModificationException
        at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1855) ~[?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
        at org.apache.helix.controller.rebalancer.waged.AssignmentManager.getBestPossibleAssignment(AssignmentManager.java:92) ~[org.apache.helix.helix-core-1.4.3-dev-202502211050.jar:1.4.3-dev-202502211050]
        ... 10 more

Tests

  • The following tests are written for this issue:
    No new tests were added

  • The following is the result of the "mvn test" command on the appropriate module:

$ mvn test -o -Dtest=TestWagedRebalancer -pl=helix-core

[INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.02 s - in org.apache.helix.controller.rebalancer.waged.TestWagedRebalancer
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  19.526 s
[INFO] Finished at: 2025-04-15T12:26:57-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
    N/A

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Comment on lines 82 to 84
return _globalBaseline.entrySet().stream().collect(
Collectors.toMap(Map.Entry::getKey,
entry -> new ResourceAssignment(entry.getValue().getRecord())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try not implementing lambda function in core piece. It hurts the performance.

@GrantPSpencer GrantPSpencer force-pushed the assignment-metadata-thread-safe branch from 3960621 to a4ae8ae Compare May 16, 2025 15:12
Comment on lines +112 to +115
Map<String, ResourceAssignment> result = new HashMap<>(_bestPossibleAssignment);
for (Map.Entry<String, ResourceAssignment> entry : _bestPossibleAssignment.entrySet()) {
result.put(entry.getKey(), new ResourceAssignment(entry.getValue().getRecord()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bestPossibleAssignment.size()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this logic I have seen for multiple lines. Please make them in a function to be called in the place similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants