-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow read only activation with less than three downstairs #1608
Conversation
Allow a read only upstairs to activate with one a single downstairs present. Added some tests and a bit of additional test framework to verify an upstairs can activate with only a single downstairs ready.
exit 1 | ||
} | ||
|
||
echo "Creating $region_count downstairs regions" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would be shorter to create + start the regions in a single dsc
call, using dsc start --create
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is shorter, yes. And, that's how we did it originally in these tests. But we ran into some issues where the creation would fail or take long enough (when creating large regions) that the test would fail. If we create first, and wait for dsc to exit we know that the creation was successful. I just took that create first then start model and applied it everywhere we have dsc.
However, this was all before dsc got smarter and can now tell us when everything is ready, so perhaps it's time to think about a one step DSC.
It's also probably time to make some "common" area that all these scripts can use, there is a bunch of copied code that is basically the same in every test.
@@ -2885,6 +2885,85 @@ async fn test_no_send_offline() { | |||
} | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_ro_activate_with_two() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe overkill, but should there be a test_ro_activate_with_one
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 792c252 I added a with_one
and updated with_two
so we check all possible combinations.
upstairs/src/client.rs
Outdated
// XXX ConnectioMode::New only valid for read only upstairs, can | ||
// we check this here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? Something like
DsState::Connecting {
mode: ConnectionMode::Faulted | ConnectionMode::Replaced,
..
}
| DsState::Stopping(
ClientStopReason::Fault(..)
| ClientStopReason::Disabled
| ClientStopReason::Replacing
| ClientStopReason::NegotiationFailed(..),
) if !self.cfg.read_only => EnqueueResult::Skip,
DsState::Connecting {
mode:
ConnectionMode::Faulted
| ConnectionMode::Replaced
| ConnectionMode::New,
..
}
| DsState::Stopping(
ClientStopReason::Fault(..)
| ClientStopReason::Disabled
| ClientStopReason::Replacing
| ClientStopReason::NegotiationFailed(..),
) if self.cfg.read_only => EnqueueResult::Skip,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried something, see what you think
Right now, the Crucible upstairs has a timer which periodically checks whether live-repair should start. Maintaining this timer is tricky: we set it to `None` if live-repair isn't possible, or `Some(t)` if live-repair _is_ possible. We have to manually keep the timer in sync with the rest of the system's state. In #1608, @leftwo expresses concern about whether he's maintaining the timer correctly; it's certainly not obvious. This PR removes the timer, moving checking for live-repair into the "invariant maintenance" category of work. In other words, every time `Upstairs::apply` is called, it will also check whether live-repair needs to start. This is a cheap check, and doing it every time means we don't have to think about keeping the timer correctly enabled / disabled. There's one edge case of behavior change: we will no longer perform live-repair on two Downstairs simultaneously. One of them will necessarily enter `LiveRepairReady` first, since events are handled in a single task; it will then immediately enter `LiveRepair` in the same call to `Upstairs::apply`. However, that was exceedingly unlikely in the existing code! Right now, when we reach `LiveRepairReady` during negotiation, we set `repair_check_deadline` to `Instant::now()`. This means that the next call to `Upstairs::select` will almost certainly choose `UpstairsAction::RepairCheck` and begin live-repair of just that Downstairs. The only way we'd repair two Downstairs simultaneously is if we get very lucky and they (1) complete negotiation at exactly the same time and (2) the second negotiation message wins the race with `tokio::time::sleep_until(Instant::now())`
upstairs/src/client.rs
Outdated
@@ -896,7 +895,7 @@ impl DownstairsClient { | |||
// client back into compliance by replaying jobs). | |||
DsState::Active => EnqueueResult::Send, | |||
DsState::Connecting { | |||
mode: ConnectionMode::Offline, | |||
mode: ConnectionMode::Offline | ConnectionMode::New, // RO client checked above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is saying that if we're in this match, it's a read-write config right? Because we'd have executed the first match if it was read-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly.
I wanted to help if someone noticed that ConnectionMode::New
appeared in two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a behavior change here for the read-write case.
Previously, it would panic if you tried to send a message to a downstairs in Connecting { mode: New, .. }
. Now, it will either return Skip
(if read-only) or Hold
(if read-write). I can believe that the read-only behavior change is intended; is the read-write behavior change also intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the read-write behavior change also intentional?
No!
I put this back in the wrong place, let me fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f212546
upstairs/src/client.rs
Outdated
DsState::Connecting { | ||
mode: ConnectionMode::New, | ||
.. | ||
} if self.cfg.read_only => EnqueueResult::Skip, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I feel like it should bail out earlier, but am not 100% confident
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've convinced myself that this makes sense. You should put a comment saying that we can be connected in read-only mode with only a single Downstairs, so the other Downstairs should skip jobs in that case.
It seems like we could also pass in UpstairsState
to distinguish here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added in f212546
Allow a read only upstairs to activate with one a single downstairs present.
In upstairs/src/upstairs.rs, I've added a check when a downstairs transitions to
WaitQuorum
.If we are read-only, then we can skip reconciliation and activate the upstairs.
If we are already active (and read only), then a new downstairs can go to active.
Added some tests and a bit of additional test framework to verify
an upstairs can activate with only a single downstairs ready.
There are a few places where I have XXX question comments that I'm not sure about the answers.
This "fixes" the feature request in #1599
and may help with #1593