Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow read only activation with less than three downstairs #1608

Merged
merged 8 commits into from
Jan 24, 2025

Conversation

leftwo
Copy link
Contributor

@leftwo leftwo commented Jan 16, 2025

Allow a read only upstairs to activate with one a single downstairs present.

In upstairs/src/upstairs.rs, I've added a check when a downstairs transitions to WaitQuorum.
If we are read-only, then we can skip reconciliation and activate the upstairs.
If we are already active (and read only), then a new downstairs can go to active.

Added some tests and a bit of additional test framework to verify
an upstairs can activate with only a single downstairs ready.

There are a few places where I have XXX question comments that I'm not sure about the answers.

This "fixes" the feature request in #1599
and may help with #1593

Allow a read only upstairs to activate with one a single downstairs
present.

Added some tests and a bit of additional test framework to verify
an upstairs can activate with only a single downstairs ready.
tools/test_read_only.sh Outdated Show resolved Hide resolved
exit 1
}

echo "Creating $region_count downstairs regions"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be shorter to create + start the regions in a single dsc call, using dsc start --create

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is shorter, yes. And, that's how we did it originally in these tests. But we ran into some issues where the creation would fail or take long enough (when creating large regions) that the test would fail. If we create first, and wait for dsc to exit we know that the creation was successful. I just took that create first then start model and applied it everywhere we have dsc.

However, this was all before dsc got smarter and can now tell us when everything is ready, so perhaps it's time to think about a one step DSC.

It's also probably time to make some "common" area that all these scripts can use, there is a bunch of copied code that is basically the same in every test.

tools/test_read_only.sh Outdated Show resolved Hide resolved
upstairs/src/upstairs.rs Outdated Show resolved Hide resolved
upstairs/src/upstairs.rs Outdated Show resolved Hide resolved
@@ -2885,6 +2885,85 @@ async fn test_no_send_offline() {
}
}

#[tokio::test]
async fn test_ro_activate_with_two() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe overkill, but should there be a test_ro_activate_with_one as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 792c252 I added a with_one and updated with_two so we check all possible combinations.

Comment on lines 857 to 858
// XXX ConnectioMode::New only valid for read only upstairs, can
// we check this here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? Something like

            DsState::Connecting {
                mode: ConnectionMode::Faulted | ConnectionMode::Replaced,
                ..
            }
            | DsState::Stopping(
                ClientStopReason::Fault(..)
                | ClientStopReason::Disabled
                | ClientStopReason::Replacing
                | ClientStopReason::NegotiationFailed(..),
            ) if !self.cfg.read_only => EnqueueResult::Skip,
            
            DsState::Connecting {
                mode:
                    ConnectionMode::Faulted
                    | ConnectionMode::Replaced
                    | ConnectionMode::New,
                ..
            }
            | DsState::Stopping(
                ClientStopReason::Fault(..)
                | ClientStopReason::Disabled
                | ClientStopReason::Replacing
                | ClientStopReason::NegotiationFailed(..),
            ) if self.cfg.read_only => EnqueueResult::Skip,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something, see what you think

mkeeter added a commit that referenced this pull request Jan 21, 2025
Right now, the Crucible upstairs has a timer which periodically checks
whether live-repair should start. Maintaining this timer is tricky: we
set it to `None` if live-repair isn't possible, or `Some(t)` if
live-repair _is_ possible. We have to manually keep the timer in sync
with the rest of the system's state.

In #1608, @leftwo
expresses concern about whether he's maintaining the timer correctly;
it's certainly not obvious.

This PR removes the timer, moving checking for live-repair into the
"invariant maintenance" category of work. In other words, every time
`Upstairs::apply` is called, it will also check whether live-repair
needs to start. This is a cheap check, and doing it every time means we
don't have to think about keeping the timer correctly enabled /
disabled.

There's one edge case of behavior change: we will no longer perform
live-repair on two Downstairs simultaneously. One of them will
necessarily enter `LiveRepairReady` first, since events are handled in a
single task; it will then immediately enter `LiveRepair` in the same
call to `Upstairs::apply`.

However, that was exceedingly unlikely in the existing code!

Right now, when we reach `LiveRepairReady` during negotiation, we set
`repair_check_deadline` to `Instant::now()`. This means that the next call to
`Upstairs::select` will almost certainly choose `UpstairsAction::RepairCheck`
and begin live-repair of just that Downstairs.

The only way we'd repair two Downstairs simultaneously is if we get very lucky
and they (1) complete negotiation at exactly the same time and (2) the second
negotiation message wins the race with
`tokio::time::sleep_until(Instant::now())`
@@ -896,7 +895,7 @@ impl DownstairsClient {
// client back into compliance by replaying jobs).
DsState::Active => EnqueueResult::Send,
DsState::Connecting {
mode: ConnectionMode::Offline,
mode: ConnectionMode::Offline | ConnectionMode::New, // RO client checked above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is saying that if we're in this match, it's a read-write config right? Because we'd have executed the first match if it was read-only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly.
I wanted to help if someone noticed that ConnectionMode::New appeared in two places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a behavior change here for the read-write case.

Previously, it would panic if you tried to send a message to a downstairs in Connecting { mode: New, .. }. Now, it will either return Skip (if read-only) or Hold (if read-write). I can believe that the read-only behavior change is intended; is the read-write behavior change also intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the read-write behavior change also intentional?

No!

I put this back in the wrong place, let me fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f212546

DsState::Connecting {
mode: ConnectionMode::New,
..
} if self.cfg.read_only => EnqueueResult::Skip,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I feel like it should bail out earlier, but am not 100% confident

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've convinced myself that this makes sense. You should put a comment saying that we can be connected in read-only mode with only a single Downstairs, so the other Downstairs should skip jobs in that case.

It seems like we could also pass in UpstairsState to distinguish here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added in f212546

@leftwo leftwo merged commit ace54d9 into main Jan 24, 2025
17 checks passed
@leftwo leftwo deleted the alan/readonly-activate-with-less branch January 24, 2025 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants