-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core] Guarantee min amount of failures for request/response using rpc chaos… #57288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Guarantee min amount of failures for request/response using rpc chaos… #57288
Conversation
… config Signed-off-by: joshlee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a mechanism to guarantee a minimum number of request and response failures for RPC chaos testing. This is a valuable addition for ensuring test coverage of failure paths for less frequently used RPCs. The implementation correctly parses the new configuration and adds logic to enforce the guaranteed failures. However, there is a critical logic issue when a finite number of total failures is specified, which can cause the failure guarantee to be violated. Additionally, the new logic is not fully covered by tests. My review includes a critical comment on the logic bug and a suggestion to add a test case to cover this scenario.
src/ray/rpc/rpc_chaos.cc
Outdated
| RpcFailure GetFailureTypeFromFailable(Failable &failable, const std::string &name) { | ||
| if (failable.num_remaining_failures == 0) { | ||
| // If < 0, unlimited failures. | ||
| return RpcFailure::None; | ||
| } | ||
|
|
||
| if (num_req_failures_[name] < failable.num_lower_bound_req_failures) { | ||
| failable.num_remaining_failures--; | ||
| num_req_failures_[name]++; | ||
| return RpcFailure::Request; | ||
| } | ||
| if (num_resp_failures_[name] < failable.num_lower_bound_resp_failures) { | ||
| failable.num_remaining_failures--; | ||
| num_resp_failures_[name]++; | ||
| return RpcFailure::Response; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical logic bug in GetFailureTypeFromFailable that can break the guarantee of minimum failures when a positive num_remaining_failures is used.
The check if (failable.num_remaining_failures == 0) at the beginning of the function will cause an early exit if the total number of failures is exhausted before all guaranteed failures have been injected.
For example, with a configuration like *="1:100:0:2:0", we expect 2 guaranteed request failures:
- The first call will inject a request failure, and
num_remaining_failureswill be decremented from 1 to 0. - The second call will find
num_remaining_failuresis 0 and returnRpcFailure::Noneimmediately, failing to inject the second guaranteed request failure.
To fix this, the check for num_remaining_failures should be performed right before each failure injection (both guaranteed and probabilistic), not once at the top of the function. The logic for decrementing num_remaining_failures should also only apply if it's positive.
A conceptual refactoring would look like this:
RpcFailure GetFailureTypeFromFailable(Failable &failable, const std::string &name) {
// First, handle guaranteed request failures.
if (num_req_failures_[name] < failable.num_lower_bound_req_failures) {
if (failable.num_remaining_failures == 0) {
return RpcFailure::None;
}
if (failable.num_remaining_failures > 0) {
failable.num_remaining_failures--;
}
num_req_failures_[name]++;
return RpcFailure::Request;
}
// Then, handle guaranteed response failures.
if (num_resp_failures_[name] < failable.num_lower_bound_resp_failures) {
if (failable.num_remaining_failures == 0) {
return RpcFailure::None;
}
if (failable.num_remaining_failures > 0) {
failable.num_remaining_failures--;
}
num_resp_failures_[name]++;
return RpcFailure::Response;
}
// Finally, handle probabilistic failures.
if (failable.num_remaining_failures == 0) {
return RpcFailure::None;
}
std::uniform_int_distribution<size_t> dist(1ul, 100ul);
const size_t random_number = dist(gen_);
if (random_number <= failable.req_failure_prob) {
if (failable.num_remaining_failures > 0) {
failable.num_remaining_failures--;
}
return RpcFailure::Request;
}
if (random_number <= failable.req_failure_prob + failable.resp_failure_prob) {
if (failable.num_remaining_failures > 0) {
failable.num_remaining_failures--;
}
return RpcFailure::Response;
}
return RpcFailure::None;
}Since this change affects lines outside of the current diff, I'm providing it here for your consideration. The key is to move the num_remaining_failures check and conditional decrement into each failure-injection path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhh i think it maybe has a point here 👁️ 👁️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I'm not entirely sure about this, since if the guaranteed failures aren't part of the total failures then it's not really the total failures. But when you set the lower bound and start using multiple RPCs, then it's inaccurate anyway. I guess we could go with this, and rename the first parameter from total failures to like probabilistic failures? Hence first parameter will always exclude the parameters caused by the 4/5 parameter.
| for (int i = 0; i < 100; i++) { | ||
| ASSERT_EQ(GetRpcFailure("method2"), RpcFailure::Request); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new test LowerBoundWithWildcard only covers the case where num_failures is set to -1 (unlimited). It would be beneficial to add another test case to cover scenarios where a positive, finite number of failures is specified. This would have caught the logic bug in GetFailureTypeFromFailable.
Please consider adding a test case like the following to ensure the logic is correct for all configurations.
TEST(RpcChaosTest, LowerBoundWithLimitedFailures) {
RayConfig::instance().testing_rpc_failure() = "*=1:0:0:2:0";
Init();
ASSERT_EQ(GetRpcFailure("method1"), RpcFailure::Request);
// This assertion will fail with the current implementation because it will return `None`,
// breaking the guarantee of 2 request failures. This test case helps verify the fix for
// the logic when a finite number of failures is specified.
ASSERT_EQ(GetRpcFailure("method1"), RpcFailure::Request);
}Signed-off-by: joshlee <[email protected]>
|
I don't think it makes sense to use the wildcard and set a non -1 value for the total amount of rpc failures tbh, but not sure if its worth enforcing. |
| }; | ||
| absl::flat_hash_map<std::string, Failable> failable_methods_ ABSL_GUARDED_BY(&mu_); | ||
| absl::flat_hash_map<std::string, size_t> num_req_failures_ ABSL_GUARDED_BY(&mu_); | ||
| absl::flat_hash_map<std::string, size_t> num_resp_failures_ ABSL_GUARDED_BY(&mu_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't these just be contained in the failable struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would break for the wildcard, since the wildcard we pass in the * and not the actual name of the rpc like for the others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ya the logic could be a little messier, like you pass 2 failables into GetFailureTypeFromFailable, one for wildcard and one for the actual name. Maybe the wildcard failable should live outside the map, and then GetFailureTypeFromFailable checks if wildcard set and gets uses the wildcard failable and the actual named one. Maybe overthinking this, I'm ok with the current state.
shouldn't enforce imo |
|
@Sparks0219 what're you planning to use this for? |
we're scared that the chaos tests won't test all the rpc's with 20% failure rates. With this we can guarantee that they all fail a req and reply during a real workload at least once |
Primarily for #57579 we wanted to be extra safe that each RPC was actually being tested at least once |
Signed-off-by: joshlee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the idea of making this already complicated field more complicated, but we need it so 😔
| }; | ||
| absl::flat_hash_map<std::string, Failable> failable_methods_ ABSL_GUARDED_BY(&mu_); | ||
| absl::flat_hash_map<std::string, size_t> num_req_failures_ ABSL_GUARDED_BY(&mu_); | ||
| absl::flat_hash_map<std::string, size_t> num_resp_failures_ ABSL_GUARDED_BY(&mu_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ya the logic could be a little messier, like you pass 2 failables into GetFailureTypeFromFailable, one for wildcard and one for the actual name. Maybe the wildcard failable should live outside the map, and then GetFailureTypeFromFailable checks if wildcard set and gets uses the wildcard failable and the actual named one. Maybe overthinking this, I'm ok with the current state.
I almost wrote this verbatim :) |
…c chaos… (ray-project#57288) Signed-off-by: joshlee <[email protected]>
…c chaos… (ray-project#57288) Signed-off-by: joshlee <[email protected]> Signed-off-by: xgui <[email protected]>
…c chaos… (#57288) Signed-off-by: joshlee <[email protected]> Signed-off-by: elliot-barn <[email protected]>
Why are these changes needed?
For cases where we use the wildcard and have a wide variety of RPCs, often the amount of times a couple RPCs will be called will be vastly greater than the others. Using the probability based config means its highly unlikely these less used RPCs will be tested. Added an additional 4/5 config to guarantee a lower bound of failures for EACH individual RPC on request and response.
Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.