[core] Guarantee min amount of failures for request/response using rpc chaos… #57288

Sparks0219 · 2025-10-07T23:59:16Z

Why are these changes needed?

For cases where we use the wildcard and have a wide variety of RPCs, often the amount of times a couple RPCs will be called will be vastly greater than the others. Using the probability based config means its highly unlikely these less used RPCs will be tested. Added an additional 4/5 config to guarantee a lower bound of failures for EACH individual RPC on request and response.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

… config Signed-off-by: joshlee <[email protected]>

gemini-code-assist

Code Review

The pull request introduces a mechanism to guarantee a minimum number of request and response failures for RPC chaos testing. This is a valuable addition for ensuring test coverage of failure paths for less frequently used RPCs. The implementation correctly parses the new configuration and adds logic to enforce the guaranteed failures. However, there is a critical logic issue when a finite number of total failures is specified, which can cause the failure guarantee to be violated. Additionally, the new logic is not fully covered by tests. My review includes a critical comment on the logic bug and a suggestion to add a test case to cover this scenario.

gemini-code-assist · 2025-10-08T00:01:46Z

src/ray/rpc/rpc_chaos.cc

+  RpcFailure GetFailureTypeFromFailable(Failable &failable, const std::string &name) {
    if (failable.num_remaining_failures == 0) {
      // If < 0, unlimited failures.
      return RpcFailure::None;
    }
+
+    if (num_req_failures_[name] < failable.num_lower_bound_req_failures) {
+      failable.num_remaining_failures--;
+      num_req_failures_[name]++;
+      return RpcFailure::Request;
+    }
+    if (num_resp_failures_[name] < failable.num_lower_bound_resp_failures) {
+      failable.num_remaining_failures--;
+      num_resp_failures_[name]++;
+      return RpcFailure::Response;
+    }


There is a critical logic bug in GetFailureTypeFromFailable that can break the guarantee of minimum failures when a positive num_remaining_failures is used.

The check if (failable.num_remaining_failures == 0) at the beginning of the function will cause an early exit if the total number of failures is exhausted before all guaranteed failures have been injected.

For example, with a configuration like *="1:100:0:2:0", we expect 2 guaranteed request failures:

The first call will inject a request failure, and num_remaining_failures will be decremented from 1 to 0.

The second call will find num_remaining_failures is 0 and return RpcFailure::None immediately, failing to inject the second guaranteed request failure.

To fix this, the check for num_remaining_failures should be performed right before each failure injection (both guaranteed and probabilistic), not once at the top of the function. The logic for decrementing num_remaining_failures should also only apply if it's positive.

A conceptual refactoring would look like this:

RpcFailure GetFailureTypeFromFailable(Failable &failable, const std::string &name) { // First, handle guaranteed request failures. if (num_req_failures_[name] < failable.num_lower_bound_req_failures) { if (failable.num_remaining_failures == 0) { return RpcFailure::None; } if (failable.num_remaining_failures > 0) { failable.num_remaining_failures--; } num_req_failures_[name]++; return RpcFailure::Request; } // Then, handle guaranteed response failures. if (num_resp_failures_[name] < failable.num_lower_bound_resp_failures) { if (failable.num_remaining_failures == 0) { return RpcFailure::None; } if (failable.num_remaining_failures > 0) { failable.num_remaining_failures--; } num_resp_failures_[name]++; return RpcFailure::Response; } // Finally, handle probabilistic failures. if (failable.num_remaining_failures == 0) { return RpcFailure::None; } std::uniform_int_distribution<size_t> dist(1ul, 100ul); const size_t random_number = dist(gen_); if (random_number <= failable.req_failure_prob) { if (failable.num_remaining_failures > 0) { failable.num_remaining_failures--; } return RpcFailure::Request; } if (random_number <= failable.req_failure_prob + failable.resp_failure_prob) { if (failable.num_remaining_failures > 0) { failable.num_remaining_failures--; } return RpcFailure::Response; } return RpcFailure::None; }

Since this change affects lines outside of the current diff, I'm providing it here for your consideration. The key is to move the num_remaining_failures check and conditional decrement into each failure-injection path.

uhh i think it maybe has a point here 👁️ 👁️

Hmmm I'm not entirely sure about this, since if the guaranteed failures aren't part of the total failures then it's not really the total failures. But when you set the lower bound and start using multiple RPCs, then it's inaccurate anyway. I guess we could go with this, and rename the first parameter from total failures to like probabilistic failures? Hence first parameter will always exclude the parameters caused by the 4/5 parameter.

gemini-code-assist · 2025-10-08T00:01:46Z

src/ray/rpc/tests/rpc_chaos_test.cc

+  for (int i = 0; i < 100; i++) {
+    ASSERT_EQ(GetRpcFailure("method2"), RpcFailure::Request);
+  }
+}


The new test LowerBoundWithWildcard only covers the case where num_failures is set to -1 (unlimited). It would be beneficial to add another test case to cover scenarios where a positive, finite number of failures is specified. This would have caught the logic bug in GetFailureTypeFromFailable.

Please consider adding a test case like the following to ensure the logic is correct for all configurations.

TEST(RpcChaosTest, LowerBoundWithLimitedFailures) { RayConfig::instance().testing_rpc_failure() = "*=1:0:0:2:0"; Init(); ASSERT_EQ(GetRpcFailure("method1"), RpcFailure::Request); // This assertion will fail with the current implementation because it will return `None`, // breaking the guarantee of 2 request failures. This test case helps verify the fix for // the logic when a finite number of failures is specified. ASSERT_EQ(GetRpcFailure("method1"), RpcFailure::Request); }

Signed-off-by: joshlee <[email protected]>

Sparks0219 · 2025-10-08T22:46:06Z

I don't think it makes sense to use the wildcard and set a non -1 value for the total amount of rpc failures tbh, but not sure if its worth enforcing.

src/ray/common/ray_config_def.h

src/ray/rpc/rpc_chaos.cc

dayshah · 2025-10-10T05:59:23Z

src/ray/rpc/rpc_chaos.cc

  };
  absl::flat_hash_map<std::string, Failable> failable_methods_ ABSL_GUARDED_BY(&mu_);
+  absl::flat_hash_map<std::string, size_t> num_req_failures_ ABSL_GUARDED_BY(&mu_);
+  absl::flat_hash_map<std::string, size_t> num_resp_failures_ ABSL_GUARDED_BY(&mu_);


why can't these just be contained in the failable struct?

it would break for the wildcard, since the wildcard we pass in the * and not the actual name of the rpc like for the others

hmm ya the logic could be a little messier, like you pass 2 failables into GetFailureTypeFromFailable, one for wildcard and one for the actual name. Maybe the wildcard failable should live outside the map, and then GetFailureTypeFromFailable checks if wildcard set and gets uses the wildcard failable and the actual named one. Maybe overthinking this, I'm ok with the current state.

dayshah · 2025-10-10T06:01:07Z

I don't think it makes sense to use the wildcard and set a non -1 value for the total amount of rpc failures tbh, but not sure if its worth enforcing.

shouldn't enforce imo

edoakes · 2025-10-10T16:15:37Z

@Sparks0219 what're you planning to use this for?

dayshah · 2025-10-10T23:14:48Z

@Sparks0219 what're you planning to use this for?

we're scared that the chaos tests won't test all the rpc's with 20% failure rates. With this we can guarantee that they all fail a req and reply during a real workload at least once

Sparks0219 · 2025-10-10T23:19:00Z

@Sparks0219 what're you planning to use this for?

Primarily for #57579 we wanted to be extra safe that each RPC was actually being tested at least once

Signed-off-by: joshlee <[email protected]>

dayshah

I don't love the idea of making this already complicated field more complicated, but we need it so 😔

dayshah · 2025-10-12T19:30:23Z

src/ray/rpc/rpc_chaos.cc

  };
  absl::flat_hash_map<std::string, Failable> failable_methods_ ABSL_GUARDED_BY(&mu_);
+  absl::flat_hash_map<std::string, size_t> num_req_failures_ ABSL_GUARDED_BY(&mu_);
+  absl::flat_hash_map<std::string, size_t> num_resp_failures_ ABSL_GUARDED_BY(&mu_);


hmm ya the logic could be a little messier, like you pass 2 failables into GetFailureTypeFromFailable, one for wildcard and one for the actual name. Maybe the wildcard failable should live outside the map, and then GetFailureTypeFromFailable checks if wildcard set and gets uses the wildcard failable and the actual named one. Maybe overthinking this, I'm ok with the current state.

edoakes · 2025-10-13T13:30:47Z

I don't love the idea of making this already complicated field more complicated, but we need it so

I almost wrote this verbatim :)

…c chaos… (ray-project#57288) Signed-off-by: joshlee <[email protected]>

…c chaos… (ray-project#57288) Signed-off-by: joshlee <[email protected]> Signed-off-by: xgui <[email protected]>

…c chaos… (#57288) Signed-off-by: joshlee <[email protected]> Signed-off-by: elliot-barn <[email protected]>

Guarantee min amount of failures for request/response using rpc chaos…

d29b0fd

… config Signed-off-by: joshlee <[email protected]>

Sparks0219 assigned jjyao and dayshah Oct 7, 2025

Sparks0219 requested a review from a team as a code owner October 7, 2025 23:59

Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 7, 2025

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 8, 2025

build error

e31b41b

Signed-off-by: joshlee <[email protected]>

This comment was marked as outdated.

Sign in to view

dayshah reviewed Oct 10, 2025

View reviewed changes

Addressing comments

a6e7402

Signed-off-by: joshlee <[email protected]>

dayshah approved these changes Oct 12, 2025

View reviewed changes

dayshah merged commit d4e4956 into ray-project:master Oct 14, 2025
6 checks passed

justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025

[core] Guarantee min amount of failures for request/response using rp…

70451d2

…c chaos… (ray-project#57288) Signed-off-by: joshlee <[email protected]>

xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025

[core] Guarantee min amount of failures for request/response using rp…

6fb938d

…c chaos… (ray-project#57288) Signed-off-by: joshlee <[email protected]> Signed-off-by: xgui <[email protected]>

elliot-barn pushed a commit that referenced this pull request Oct 23, 2025

[core] Guarantee min amount of failures for request/response using rp…

8828354

…c chaos… (#57288) Signed-off-by: joshlee <[email protected]> Signed-off-by: elliot-barn <[email protected]>

[core] Guarantee min amount of failures for request/response using rpc chaos… #57288

[core] Guarantee min amount of failures for request/response using rpc chaos… #57288

Uh oh!

Conversation

Sparks0219 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Sparks0219 commented Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

dayshah Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah commented Oct 10, 2025

Uh oh!

edoakes commented Oct 10, 2025

Uh oh!

dayshah commented Oct 10, 2025

Uh oh!

Sparks0219 commented Oct 10, 2025

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sparks0219 commented Oct 7, 2025 •

edited

Loading