Skip to content

Conversation

ZacAttack
Copy link
Contributor

After some experimentation, the main culprit for the performance degredation is actually from the lag probe being too aggressive. The default lag probe previously being 250ms caused as much as a 20% degredation in performance when used in combination with with enabling io_context metrics. Setting the default to abouve 60s seems to mitigate the issue. To come to this conclusion we tested with the below:

Trail 1: ~400 actors/s <-- way too slow
-RAY_emit_main_serivce_metrics = 1

Trial 2: ~500+ actor/s <-- where we want to be
-RAY_emit_main_serivce_metrics = -1

Trial 3: ~500+ actor/s
-RAY_emit_main_serivce_metrics = 1
-RAY_io_context_event_loop_lag_collection_interval_ms = -1 <-- disabled

Trial 4: ~500+ actor/s <-- bingo!
-RAY_emit_main_serivce_metrics = 1
-RAY_io_context_event_loop_lag_collection_interval_ms = 6000

The default value of 250ms combined with the increased usage of lag probes when the metrics are enabled causes enough degredation as to be noticable. Increasing the interval sufficiently seems to be the way to go to avoid this and have our metrics.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

After some experimentation, the main culprit for the performance
degredation is actually from the lag probe being too aggressive.  The
default lag probe previously being 250ms caused as much as a 20%
degredation in performance when used in combination with with enabling
io_context metrics.  Setting the default to abouve 60s seems to mitigate
the issue.  To come to this conclusion we tested with the below:

Trail 1: ~400 actors/s <-- way too slow
 -RAY_emit_main_serivce_metrics = 1

Trial 2: ~500+ actor/s <-- where we want to be
 -RAY_emit_main_serivce_metrics = -1

Trial 3: ~500+ actor/s
 -RAY_emit_main_serivce_metrics = 1
 -RAY_io_context_event_loop_lag_collection_interval_ms = -1 <-- disabled

Trial 4: ~500+ actor/s <-- bingo!
 -RAY_emit_main_serivce_metrics = 1
 -RAY_io_context_event_loop_lag_collection_interval_ms = 6000

The default value of 250ms combined with the increased usage of lag
probes when the metrics are enabled causes enough degredation as to be
noticable.  Increasing the interval sufficiently seems to be the way to
go to avoid this and have our metrics.

Signed-off-by: zac <[email protected]>
@ZacAttack ZacAttack requested a review from a team as a code owner October 9, 2025 23:00
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates default configurations to enable io_service metrics by default. It enables emit_main_service_metrics and increases io_context_event_loop_lag_collection_interval_ms to mitigate performance issues observed with the previous default. My review focuses on a potential discrepancy in the new interval value.

Signed-off-by: zac <[email protected]>
@ZacAttack ZacAttack added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Oct 9, 2025
Co-authored-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: Zac Policzer <[email protected]>
@edoakes edoakes merged commit 4d75ad2 into ray-project:master Oct 10, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants