-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sled expunge on london resulted in unfinished downstairs replacements. #1591
Comments
Some pantry logs for 14 at: |
Starting at the top replacement on the list
So, it's on a pantry, but which sled? To find it we look for a sled that shares the the
Sled 14, great.
So, region
And the disks info, which goes deep (as this is a disk based off of a 35 deep snapshot tree:
The pantry logs from sled 14 for just this session (5bd29544) show this pantry was able to reconcile all three downstairs and that everything should be done: (Logs for this session are at pantry-log-just-session-5bd29544-from-1734988498 ):
So, for this first region replacement (06cd17c3) item, the pantry does think it's done.
That the three downstairs are all still active tells me that either the pantry has not told Nexus that the "repair" is done, or that nexus has not asked to confirm that the pantry has completed. |
Looking at omdb db sagas output, and looking for sagas that reference
Then, another like the one above, then this one that looks a little different:
Then another action failed:
Then, a bunch more of those which appear to be repeating every 10 seconds. The last one I have in the log is this which still looks the same as other, that same
|
I tried expunging disks one at a time while also checking which omicron services had their zone datasets on the disk being expunged and ran blueprint updates (i.e., regenerate, target set, etc) after every disk expunge. I hit upon a case of unfinished region replacement when the physical disk hosted the nexus dataset and the particular nexus was handling the region replacement saga at the time it was expunged. I don't know if this would explain why the replacement couldn't finish. Here are some key events and related data: These are the region replacement jobs triggered by the disk expungement - one of them failed to complete:
The associated pantry started to return connection timeout errors at
That is around the time when nexus zone was being expunged:
The last few sagas processed by the nexus zone reaped was for the replacement id
After reconfigurator has replaced the expunged nexus with a new instance of nexus, pantry didn't recover from the connection timeout error. I tried bouncing the pantry SMF service and it didn't eliminate the error. I wonder if the timeout is due to a step that's stuck indefinitely because nexus went away. |
The nexus dependency is a red herring or a different issue. I retried sled expungement on london with a sled that has no nexus nor pantry zone but still got some region and snapshot replacement jobs stuck in running or allocating states. So far, these problem replacements are all related to detached disks. |
Yeah, I did not mention it in my initial reporting, but I had first added a sled, then expunged it, so it only had Crucible and NTP zones. |
So, the above disk info actually points to one issue we have here, that will prevent repair from making forward progress. However, we don't have a simple disk in this case. We have a VCR with a RW sub-volume, and a tree of read only parents. From above:
How the pantry (or propolis) works with multi level VCR is that it spins up an upstairs instance for each sub-volume (of which there is currently only ever one) and one upstairs for each level of the read only parent. In order for the pantry to consider a repair completed, all the upstairs instances have to make it through activation. In the VCR received by the pantry, the pantry does not know which sub-volume or read only parent level is the one that we want to repair, so it activates (or tries to activate) all of them. And, here is the problem. While the upstairs instance that needed to do a repair will do so and activate, there are other parts of this VCR that also need to be repaired, (in the example above, any layer that has a This also explains the dtrace output that found an upstairs instance with two WQ and one NEW:
This would be an upstairs instance that was still trying to contact a downstairs on an expunged sled. |
I expunged a sled on london with a bunch of disks and snapshots on that sled and let the repairs commence.
I am running omicron build:
ef65ae6ae9bc9e8fca68ff2b3cc69765f0e75b6b
which landed Friday Dec 20thA bunch of repairs completed, but many seem to be stuck:
Looking further, I see a bunch of crucible_pantry processes that have downstairs in
WaitQuorum
.I don't know yet if this is the problem, but I wanted to record this before I lose state on it.
Here is some upstairs state on sled 14:
And, more info from 14:
The text was updated successfully, but these errors were encountered: