Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 7, 2025

Why are these changes needed?

For cases where we use the wildcard and have a wide variety of RPCs, often the amount of times a couple RPCs will be called will be vastly greater than the others. Using the probability based config means its highly unlikely these less used RPCs will be tested. Added an additional 4/5 config to guarantee a lower bound of failures for EACH individual RPC on request and response.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Sparks0219 Sparks0219 requested a review from a team as a code owner October 7, 2025 23:59
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a mechanism to guarantee a minimum number of request and response failures for RPC chaos testing. This is a valuable addition for ensuring test coverage of failure paths for less frequently used RPCs. The implementation correctly parses the new configuration and adds logic to enforce the guaranteed failures. However, there is a critical logic issue when a finite number of total failures is specified, which can cause the failure guarantee to be violated. Additionally, the new logic is not fully covered by tests. My review includes a critical comment on the logic bug and a suggestion to add a test case to cover this scenario.

Comment on lines 142 to 157
RpcFailure GetFailureTypeFromFailable(Failable &failable, const std::string &name) {
if (failable.num_remaining_failures == 0) {
// If < 0, unlimited failures.
return RpcFailure::None;
}

if (num_req_failures_[name] < failable.num_lower_bound_req_failures) {
failable.num_remaining_failures--;
num_req_failures_[name]++;
return RpcFailure::Request;
}
if (num_resp_failures_[name] < failable.num_lower_bound_resp_failures) {
failable.num_remaining_failures--;
num_resp_failures_[name]++;
return RpcFailure::Response;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical logic bug in GetFailureTypeFromFailable that can break the guarantee of minimum failures when a positive num_remaining_failures is used.

The check if (failable.num_remaining_failures == 0) at the beginning of the function will cause an early exit if the total number of failures is exhausted before all guaranteed failures have been injected.

For example, with a configuration like *="1:100:0:2:0", we expect 2 guaranteed request failures:

  1. The first call will inject a request failure, and num_remaining_failures will be decremented from 1 to 0.
  2. The second call will find num_remaining_failures is 0 and return RpcFailure::None immediately, failing to inject the second guaranteed request failure.

To fix this, the check for num_remaining_failures should be performed right before each failure injection (both guaranteed and probabilistic), not once at the top of the function. The logic for decrementing num_remaining_failures should also only apply if it's positive.

A conceptual refactoring would look like this:

RpcFailure GetFailureTypeFromFailable(Failable &failable, const std::string &name) {
  // First, handle guaranteed request failures.
  if (num_req_failures_[name] < failable.num_lower_bound_req_failures) {
    if (failable.num_remaining_failures == 0) {
      return RpcFailure::None;
    }
    if (failable.num_remaining_failures > 0) {
      failable.num_remaining_failures--;
    }
    num_req_failures_[name]++;
    return RpcFailure::Request;
  }

  // Then, handle guaranteed response failures.
  if (num_resp_failures_[name] < failable.num_lower_bound_resp_failures) {
    if (failable.num_remaining_failures == 0) {
      return RpcFailure::None;
    }
    if (failable.num_remaining_failures > 0) {
      failable.num_remaining_failures--;
    }
    num_resp_failures_[name]++;
    return RpcFailure::Response;
  }

  // Finally, handle probabilistic failures.
  if (failable.num_remaining_failures == 0) {
    return RpcFailure::None;
  }
  
  std::uniform_int_distribution<size_t> dist(1ul, 100ul);
  const size_t random_number = dist(gen_);
  if (random_number <= failable.req_failure_prob) {
    if (failable.num_remaining_failures > 0) {
      failable.num_remaining_failures--;
    }
    return RpcFailure::Request;
  }
  if (random_number <= failable.req_failure_prob + failable.resp_failure_prob) {
    if (failable.num_remaining_failures > 0) {
      failable.num_remaining_failures--;
    }
    return RpcFailure::Response;
  }

  return RpcFailure::None;
}

Since this change affects lines outside of the current diff, I'm providing it here for your consideration. The key is to move the num_remaining_failures check and conditional decrement into each failure-injection path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh i think it maybe has a point here 👁️ 👁️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I'm not entirely sure about this, since if the guaranteed failures aren't part of the total failures then it's not really the total failures. But when you set the lower bound and start using multiple RPCs, then it's inaccurate anyway. I guess we could go with this, and rename the first parameter from total failures to like probabilistic failures? Hence first parameter will always exclude the parameters caused by the 4/5 parameter.

for (int i = 0; i < 100; i++) {
ASSERT_EQ(GetRpcFailure("method2"), RpcFailure::Request);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new test LowerBoundWithWildcard only covers the case where num_failures is set to -1 (unlimited). It would be beneficial to add another test case to cover scenarios where a positive, finite number of failures is specified. This would have caught the logic bug in GetFailureTypeFromFailable.

Please consider adding a test case like the following to ensure the logic is correct for all configurations.

TEST(RpcChaosTest, LowerBoundWithLimitedFailures) {
  RayConfig::instance().testing_rpc_failure() = "*=1:0:0:2:0";
  Init();

  ASSERT_EQ(GetRpcFailure("method1"), RpcFailure::Request);
  // This assertion will fail with the current implementation because it will return `None`,
  // breaking the guarantee of 2 request failures. This test case helps verify the fix for
  // the logic when a finite number of failures is specified.
  ASSERT_EQ(GetRpcFailure("method1"), RpcFailure::Request);
}

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 8, 2025
Signed-off-by: joshlee <[email protected]>
cursor[bot]

This comment was marked as outdated.

@Sparks0219
Copy link
Contributor Author

I don't think it makes sense to use the wildcard and set a non -1 value for the total amount of rpc failures tbh, but not sure if its worth enforcing.

};
absl::flat_hash_map<std::string, Failable> failable_methods_ ABSL_GUARDED_BY(&mu_);
absl::flat_hash_map<std::string, size_t> num_req_failures_ ABSL_GUARDED_BY(&mu_);
absl::flat_hash_map<std::string, size_t> num_resp_failures_ ABSL_GUARDED_BY(&mu_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't these just be contained in the failable struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would break for the wildcard, since the wildcard we pass in the * and not the actual name of the rpc like for the others

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ya the logic could be a little messier, like you pass 2 failables into GetFailureTypeFromFailable, one for wildcard and one for the actual name. Maybe the wildcard failable should live outside the map, and then GetFailureTypeFromFailable checks if wildcard set and gets uses the wildcard failable and the actual named one. Maybe overthinking this, I'm ok with the current state.

@dayshah
Copy link
Contributor

dayshah commented Oct 10, 2025

I don't think it makes sense to use the wildcard and set a non -1 value for the total amount of rpc failures tbh, but not sure if its worth enforcing.

shouldn't enforce imo

@edoakes
Copy link
Collaborator

edoakes commented Oct 10, 2025

@Sparks0219 what're you planning to use this for?

@dayshah
Copy link
Contributor

dayshah commented Oct 10, 2025

@Sparks0219 what're you planning to use this for?

we're scared that the chaos tests won't test all the rpc's with 20% failure rates. With this we can guarantee that they all fail a req and reply during a real workload at least once

@Sparks0219
Copy link
Contributor Author

@Sparks0219 what're you planning to use this for?

Primarily for #57579 we wanted to be extra safe that each RPC was actually being tested at least once

Signed-off-by: joshlee <[email protected]>
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of making this already complicated field more complicated, but we need it so 😔

};
absl::flat_hash_map<std::string, Failable> failable_methods_ ABSL_GUARDED_BY(&mu_);
absl::flat_hash_map<std::string, size_t> num_req_failures_ ABSL_GUARDED_BY(&mu_);
absl::flat_hash_map<std::string, size_t> num_resp_failures_ ABSL_GUARDED_BY(&mu_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ya the logic could be a little messier, like you pass 2 failables into GetFailureTypeFromFailable, one for wildcard and one for the actual name. Maybe the wildcard failable should live outside the map, and then GetFailureTypeFromFailable checks if wildcard set and gets uses the wildcard failable and the actual named one. Maybe overthinking this, I'm ok with the current state.

@edoakes
Copy link
Collaborator

edoakes commented Oct 13, 2025

I don't love the idea of making this already complicated field more complicated, but we need it so

I almost wrote this verbatim :)

@dayshah dayshah merged commit d4e4956 into ray-project:master Oct 14, 2025
6 checks passed
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants