Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sled expunge on london resulted in unfinished downstairs replacements. #1591

Open
leftwo opened this issue Dec 24, 2024 · 7 comments
Open

Sled expunge on london resulted in unfinished downstairs replacements. #1591

leftwo opened this issue Dec 24, 2024 · 7 comments
Milestone

Comments

@leftwo
Copy link
Contributor

leftwo commented Dec 24, 2024

I expunged a sled on london with a bunch of disks and snapshots on that sled and let the repairs commence.
I am running omicron build: ef65ae6ae9bc9e8fca68ff2b3cc69765f0e75b6b which landed Friday Dec 20th

A bunch of repairs completed, but many seem to be stuck:

root@oxz_switch1:~#  echo SNAPSHOTS; omdb db region-snapshot-replacement list 2> /dev/null | grep -v Complete ; echo REGIONS; omdb db region-replacement list 2> /dev/null | grep -v Complete 
SNAPSHOTS
Region snapshot replacement requests                                            
ID                                   REQUEST_TIME             REPLACEMENT_STATE 
03b153bc-1fb0-443b-98f4-8ac00fdf825c 2024-12-23T20:49:14.882Z Running           
1e4a3855-1a5d-4537-8134-f589e968d7ff 2024-12-23T20:49:14.745Z Running           
25a7db5a-7344-40ea-80ac-97d520f6e72c 2024-12-23T20:49:14.796Z Running           
4c517d42-8632-4220-9518-b0f98c49ebf5 2024-12-23T20:49:14.921Z Running           
57cd6fb8-9cbd-4d8b-957e-e5b7f4904922 2024-12-23T20:49:15.066Z Running           
5dbbec31-fd86-4254-b9f5-a0a6a6050e4d 2024-12-23T20:49:15.048Z Running           
74bb8bfb-dad6-4884-9e46-8bcc0aa4ccdf 2024-12-23T20:49:14.815Z Running           
a3dc3024-9d51-4f68-8aea-3c98e3bed3a1 2024-12-23T20:49:15.123Z Running           
ad0e02f7-1543-4a6b-a7be-34e8ad2aaee7 2024-12-23T20:49:14.905Z Running           
bffcdd99-3014-4c09-aaf7-edeeb9bf9c79 2024-12-23T20:49:14.864Z Running           
f6189037-1cc9-4120-a7ad-7d42275a12b4 2024-12-23T20:49:14.686Z Running           
f84b53a8-41e7-4d82-9eb6-0b52f2348319 2024-12-23T20:49:14.977Z Running           
REGIONS
Region replacement requests                                                     
ID                                   REQUEST_TIME             REPLACEMENT_STATE 
06cd17c3-1308-4772-9c41-52eed351364f 2024-12-23T20:49:15.015Z Running           
1daaa522-b5b0-4918-a80d-0a840aa813ba 2024-12-23T20:49:14.911Z Running           
29cbffab-de46-420f-9cf6-9fc52dd48121 2024-12-23T20:49:15.542Z Running           
394b09a1-8914-4b3d-8542-4bf9855498b5 2024-12-23T20:49:15.262Z Running           
4fa0e3b3-ad53-40df-af8d-59b5eaf1f587 2024-12-23T20:49:15.620Z Running           
7cf62c6a-f936-416f-a21e-0fbfa59975b2 2024-12-23T20:49:15.677Z Running           
862282ab-243d-43b8-8b94-dfda4b9b80b5 2024-12-23T20:49:15.993Z Running           
9558c13c-af8f-452c-a4dc-7b921b476555 2024-12-23T20:49:14.678Z Running           
a43c9b94-e861-4e31-b5d4-ce800e17da55 2024-12-23T20:49:15.246Z Running           
ab7ba1f2-6d4b-4a5e-aa85-99e9905a9f1e 2024-12-23T20:49:15.093Z Running           
b1e3307d-029f-49be-9ebe-5088b03efb78 2024-12-23T20:49:15.213Z Running           
c55737ed-8a70-4e06-95b4-f1bae6224d34 2024-12-23T20:49:15.193Z Running           
f95abd72-03df-472f-898c-4321d05a7330 2024-12-23T20:49:15.325Z Running           
root@oxz_switch1:~# date
Tue Dec 24 20:05:21 UTC 2024

Looking further, I see a bunch of crucible_pantry processes that have downstairs in WaitQuorum.
I don't know yet if this is the problem, but I wanted to record this before I lose state on it.
Here is some upstairs state on sled 14:

root@oxz_switch1:~# pilot host exec -c 'hostname && /opt/oxide/crucible_dtrace/get-up-state.sh' 14
14  BRM42220036        ok: BRM42220036
  PID  SESSION DS0 DS1 DS2   NEXT_JOB  DELTA CONN   ELR   ELC   ERR   ERN
 1009 3bfb134a NEW NEW NEW      32254      0    3     0     0     0     0
 1009 539dae1e ACT ACT ACT      50342      0    4    74   294     0     0
 1993 5b1977b5 ACT ACT ACT      46649      0    4    74   294     0     0
 1993 f708df24 NEW NEW NEW      32258      0    3     0     0     0     0
 2775 7a340e99 ACT ACT ACT      50434      0    4    74   294     0     0
 2775 a5e3b223 NEW NEW NEW      32396      0    3     0     0     0     0
 3799 6a063284 ACT ACT ACT      45104      0    4    74   294     0     0
 3799 ec73b3b2 NEW NEW NEW      32284      0    3     0     0     0     0
 4675 7f857e2b ACT ACT ACT      46804      0    4    74   294     0     0
 4675 9ddf3ced NEW NEW NEW      32285      0    3     0     0     0     0
23599 0cfcf49a ACT ACT ACT       1000      0    3     0     0     0     0
23599 0d11352b NEW  WQ  WQ       1000      0    2     0     0     0     0
23599 1110b676 NEW NEW NEW       1000      0    0     0     0     0     0
23599 1a8e3979 NEW NEW NEW       1000      0    0     0     0     0     0
23599 1ff57192 NEW NEW NEW       1000      0    0     0     0     0     0
23599 2e3546f5 NEW NEW NEW       1000      0    0     0     0     0     0
23599 2f0400c0 NEW NEW NEW       1000      0    0     0     0     0     0
23599 421d69d5 NEW NEW NEW       1000      0    0     0     0     0     0
23599 52895e05 NEW NEW NEW       1000      0    0     0     0     0     0
23599 5bd29544 ACT ACT ACT       1000      0    3     0     0     0     0
23599 5daad607 NEW NEW NEW       1000      0    0     0     0     0     0
23599 71d5a05c NEW NEW NEW       1000      0    0     0     0     0     0
23599 78e6b938 ACT ACT ACT       1000      0    3     0     0     0     0
23599 87a4e2ef NEW NEW NEW       1000      0    0     0     0     0     0
23599 96d01ffd NEW NEW NEW       1000      0    0     0     0     0     0
23599 a05082d3 NEW NEW NEW       1000      0    0     0     0     0     0
23599 a963c947 NEW NEW NEW       1000      0    0     0     0     0     0
23599 b38bf997 NEW NEW NEW       1000      0    0     0     0     0     0
23599 b7249a0e NEW NEW NEW       1000      0    0     0     0     0     0
23599 b889ab8d NEW NEW NEW       1000      0    0     0     0     0     0
23599 c083ad7e NEW NEW NEW       1000      0    0     0     0     0     0
23599 c30a751d NEW NEW NEW       1000      0    0     0     0     0     0
23599 d62e6274 ACT ACT ACT       1000      0    3     0     0     0     0
23599 d8f92d97 NEW NEW NEW       1000      0    0     0     0     0     0
23599 d9554e89 NEW NEW NEW       1000      0    0     0     0     0     0
23599 db92d4bb ACT ACT ACT       1000      0    3     0     0     0     0
23599 df59bbbb NEW NEW NEW       1000      0    0     0     0     0     0
23599 edae771d NEW  WQ  WQ       1000      0    2     0     0     0     0
23599 ef30e826  WQ  WQ NEW       1000      0    2     0     0     0     0
23599 f2be7504 NEW NEW NEW       1000      0    0     0     0     0     0
23599 fcd9d3ae ACT ACT ACT       1000      0    3     0     0     0     0
23599 fe78d557 NEW NEW NEW       1000      0    0     0     0     0     0
27569 e8d0fb18 ACT ACT ACT      22703      0    3     0     0   296     0
29899 1013f595 ACT ACT ACT      43815      0    4    74   294     0     0
29899 3d815ada NEW NEW NEW      32225      0    3     0     0     0     0

And, more info from 14:

root@oxz_switch1:~# pilot host exec -c 'hostname && /opt/oxide/crucible_dtrace/get-ds-state.sh' 14
14  BRM42220036        ok: BRM42220036
oxz_propolis-server_427c85cb  1009 3bfb134a ACT ACT ACT
oxz_propolis-server_427c85cb  1009 3bfb134a NEW NEW NEW
oxz_propolis-server_bb1efa43  1993 5b1977b5 ACT ACT ACT
oxz_propolis-server_bb1efa43  1993 f708df24 NEW NEW NEW
oxz_propolis-server_1fecd471  2775 7a340e99 ACT ACT ACT
oxz_propolis-server_1fecd471  2775 7a340e99 NEW NEW NEW
oxz_propolis-server_1fecd471  2775 a5e3b223 ACT ACT ACT
oxz_propolis-server_1fecd471  2775 a5e3b223 NEW NEW NEW
oxz_propolis-server_ef3db0cb  3799 6a063284 ACT ACT ACT
oxz_propolis-server_ef3db0cb  3799 6a063284 NEW NEW NEW
oxz_propolis-server_cb99502b  4675 7f857e2b ACT ACT ACT
oxz_propolis-server_cb99502b  4675 9ddf3ced NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 0cfcf49a ACT ACT ACT
oxz_crucible_pantry_47fa71bb 23599 0d11352b NEW  WQ  WQ
oxz_crucible_pantry_47fa71bb 23599 1110b676 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 1a8e3979 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 1ff57192 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 2e3546f5 NEW  WQ  WQ
oxz_crucible_pantry_47fa71bb 23599 2e3546f5 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 2f0400c0 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 421d69d5 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 52895e05 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 5bd29544 ACT ACT ACT
oxz_crucible_pantry_47fa71bb 23599 5daad607 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 71d5a05c NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 78e6b938 ACT ACT ACT
oxz_crucible_pantry_47fa71bb 23599 87a4e2ef NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 96d01ffd NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 a05082d3 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 a963c947 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 b38bf997 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 b7249a0e NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 b889ab8d NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 c083ad7e NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 c30a751d NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 d62e6274 ACT ACT ACT
oxz_crucible_pantry_47fa71bb 23599 d8f92d97 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 d9554e89 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 df59bbbb NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 ef30e826  WQ  WQ NEW
oxz_crucible_pantry_47fa71bb 23599 f2be7504 ACT ACT ACT
oxz_crucible_pantry_47fa71bb 23599 f2be7504 NEW NEW NEW
oxz_crucible_pantry_47fa71bb 23599 fcd9d3ae ACT ACT ACT
oxz_crucible_pantry_47fa71bb 23599 fe78d557 NEW NEW NEW
oxz_propolis-server_aa745070 27569 e8d0fb18 ACT ACT ACT
oxz_propolis-server_b857410c 29899 1013f595 ACT ACT ACT
oxz_propolis-server_b857410c 29899 3d815ada NEW NEW NEW
@leftwo
Copy link
Contributor Author

leftwo commented Dec 24, 2024

Some pantry logs for 14 at: /staff/core/crucible-1591/BRM42220036

@leftwo leftwo changed the title Sled expunge on Sled expunge on london resulted in unfinished downstairs replacements. Dec 24, 2024
@leftwo
Copy link
Contributor Author

leftwo commented Dec 27, 2024

Starting at the top replacement on the list

root@oxz_switch1:~# omdb  db region-replacement info 06cd17c3-1308-4772-9c41-52eed351364f 2> /dev/null
      started: 2024-12-23 20:49:15.015593 UTC
        state: Driving
old region id: 30a96ee9-b562-40a8-ab9c-57fca0f84809
new region id: Some(452b7095-312e-4d1a-9285-4ef344d09df4)

Repair notifications
 REPAIR_ID REPAIR_TYPE UPSTAIRS_ID SESSION_ID NOTIFICATION_TYPE


Repair steps
TIME                     STEP_TYPE DETAILS
2024-12-23T20:50:17.920Z Pantry    address Some(fd00:1122:3344:101::7):Some(SqlU16(17000)) job Some(33f5ded5-1164-47fa-96be-903653809e8f)

So, it's on a pantry, but which sled? To find it we look for a sled that shares the the fd00:1122:3344:101:: address:

root@oxz_switch1:~# pilot host exec -c 'ipadm | grep 1122:3344 | grep sled6' 14 15 16
14  BRM42220036        ok: underlay0/sled6   static   ok           fd00:1122:3344:101::1/64
15  BRM42220062        ok: underlay0/sled6   static   ok           fd00:1122:3344:103::1/64
16  BRM42220030        ok: underlay0/sled6   static   ok           fd00:1122:3344:102::1/64

Sled 14, great.
Looking at the pantry logs from sled 14, I found a line with the new region ID from the region-repair output. I can use that region ID to find the session ID for that upstairs pantry.

{"msg":"downstairs client at Some([fd00:1122:3344:102::e]:19007) has region UUID 452b7095-312e-4d1a-9285-4ef344d09df4","v":0,"name":"crucible-pantry","level":30,"time":"2024-12-23T20:50:18.591572497Z","hostname":"oxz_crucible_pantry_47fa71bb-12db-4d5e-83a2-0e99624beec6","pid":23599,"client":"0","":"downstairs","session_id":"5bd29544-5b03-455e-b614-b5e3901afc2a","component":"datafile"}

So, region 452b7095 is connected to session 5bd29544, I can use this later to find the whole upstairs log for this session.
That crucible UUID, it's also the disk UUID, so I can now look up the disk in omdb:

root@oxz_switch1:~# omdb db disks list 2> /dev/null | egrep "ID|bef4889d"
ID                                   SIZE    STATE    NAME        ATTACHED_TO
bef4889d-e590-458e-81ec-b997e18f3760 200 GiB detached disk-20     -

And the disks info, which goes deep (as this is a disk based off of a 35 deep snapshot tree:

root@oxz_switch1:~# omdb db disks info bef4889d-e590-458e-81ec-b997e18f3760
HOST_SERIAL DISK_NAME INSTANCE_NAME PROPOLIS_ZONE VOLUME_ID                            DISK_STATE
-           disk-20   -             -             c9fd9286-4528-4212-be60-070178b3e093 detached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220062 f153ee6c-910d-4628-8350-ba9022bb5158 oxz_crucible_31942354-90cc-4f03-86c8-31b5c8ab33a7 591b6235-01c7-41bc-ba37-03d3c1b1e9dc
BRM42220030 452b7095-312e-4d1a-9285-4ef344d09df4 oxz_crucible_3b82acd4-61e0-421b-921d-44b042254c69 544ed702-38fe-4314-b7c5-8de94cc19304
BRM42220030 ccc72284-7f4e-4ae6-acd3-3edbef635fff oxz_crucible_ddb565c5-7cdc-406c-9abc-b396bb5effd3 e2f72669-bb24-4f3d-bc1a-cebc7f9d2aad

VCR from volume ID c9fd9286-4528-4212-be60-070178b3e093
ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT
bef4889d-e590-458e-81ec-b997e18f3760 512 1           true

SUB VOLUME 0
    ID                                   BS  BPE    EC   GEN READ_ONLY
    bef4889d-e590-458e-81ec-b997e18f3760 512 131072 3200 5   false
    [fd00:1122:3344:102::e]:19007
    [fd00:1122:3344:102::8]:19006
    [fd00:1122:3344:103::7]:19003

READ ONLY PARENT:
    ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT
    fef63546-9a7b-426b-ab2c-e72f6a3283ca 512 1           true

    SUB VOLUME 0
        ID                                   BS  BPE    EC   GEN READ_ONLY
        fd7b68c1-268f-4510-8d28-def8c6fddc95 512 131072 3200 2   true
        [fd00:1122:3344:101::b]:19005
        [fd00:1122:3344:102::d]:19005
        [fd00:1122:3344:121::24]:19008
        
    READ ONLY PARENT:
        ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT
        67fe02af-af39-413e-9c60-d6d3af08d66d 512 1           true
        
        SUB VOLUME 0
            ID                                   BS  BPE    EC   GEN READ_ONLY
            39bbc2e1-f5eb-440d-9815-962fafe1fabf 512 131072 3200 2   true
            [fd00:1122:3344:121::28]:19005
            [fd00:1122:3344:101::a]:19012
            [fd00:1122:3344:102::d]:19003
   ... continues, but cut for space

The pantry logs from sled 14 for just this session (5bd29544) show this pantry was able to reconcile all three downstairs and that everything should be done: (Logs for this session are at pantry-log-just-session-5bd29544-from-1734988498 ):

2024-12-23 20:50:22.919Z INFO crucible-pantry/23599 (datafile) on oxz_crucible_pantry_47fa71bb-12db-4d5e-83a2-0e99624beec6: All required reconciliation work is completed
2024-12-23 20:50:22.919Z INFO crucible-pantry/23599 (datafile) on oxz_crucible_pantry_47fa71bb-12db-4d5e-83a2-0e99624beec6: Set Downstairs and Upstairs active after reconciliation
2024-12-23 20:50:22.919Z INFO crucible-pantry/23599 (datafile) on oxz_crucible_pantry_47fa71bb-12db-4d5e-83a2-0e99624beec6: bef4889d-e590-458e-81ec-b997e18f3760 is now active with session: 5bd29544-5b03-455e-b614-b5e3901afc2a
2024-12-23 20:50:22.919Z INFO crucible-pantry/23599 (datafile) on oxz_crucible_pantry_47fa71bb-12db-4d5e-83a2-0e99624beec6: Set Active after no reconciliation

So, for this first region replacement (06cd17c3) item, the pantry does think it's done.
Looking on sled 14, I see this pantry session has downstairs states that all match:

BRM42220036 # dtrace -s /opt/oxide/crucible_dtrace/sled_upstairs_info.d | grep 5bd29544
23599 5bd29544            active            active            active     0     0      1000 

That the three downstairs are all still active tells me that either the pantry has not told Nexus that the "repair" is done, or that nexus has not asked to confirm that the pantry has completed.

@leftwo
Copy link
Contributor Author

leftwo commented Dec 27, 2024

Looking at omdb db sagas output, and looking for sagas that reference 06cd17c3-1308-4772-9c41-52eed351364f
I can see the following.
(logs at omdb-saga-output-region-replacement)
Note that the fetch limit would return memory errors if I went above 17000, so possibly not all saga output was gathered, though these times to line up around what the pantry logs show.

From id 7c3c085b-7928-490a-94ca-20e879bfa9e0
 id                                   | time_created                   | name                     | state                 
--------------------------------------+--------------------------------+--------------------------+-----------------------
 7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.012561 UTC | region-replacement-start | SagaCachedState(Done) 

                             saga id | event time                     | node id                                   | event type    | data 
------------------------------------ | ------------------------------ | ----------------------------------------- | ------------- | ---
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.016631 UTC |  12: start                                | started       | 
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.019932 UTC |  12: start                                | succeeded     | 
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.023532 UTC |   0: common.uuid_generate                 | started       |    
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.027644 UTC |   0: common.uuid_generate                 | succeeded     | "bd5db001-b4f8-47a1-8e19-e9ff4386584c"
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.032449 UTC |   1: common.uuid_generate                 | started       |    
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.036413 UTC |   1: common.uuid_generate                 | succeeded     | "f2d16438-df3a-4fe0-ac04-751d78081688"
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.039561 UTC |   2: region_replacement_start.set_saga_id | started       | 
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.088642 UTC |   2: region_replacement_start.set_saga_id | failed        | {"ActionFailed":{"source_error":{"Conflict":{"message":{"external_message":"region replacement 06cd17c3-1308-4772-9c41-52eed351364f set to Allocating (operating saga id Some(16e518cd-d0fe-4d8c-9ca3-df94829a2b25))","internal_context":""}}}}}
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.098796 UTC |   1: common.uuid_generate                 | undo_started  |    
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.103090 UTC |   1: common.uuid_generate                 | undo_finished |    
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.107068 UTC |   0: common.uuid_generate                 | undo_started  | 
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.111779 UTC |   0: common.uuid_generate                 | undo_finished | 
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.114916 UTC |  12: start                                | undo_started  | 
7c3c085b-7928-490a-94ca-20e879bfa9e0 | 2024-12-23 20:49:16.117723 UTC |  12: start                                | undo_finished |  

Then, another like the one above, then this one that looks a little different:

From id 6ce9262a-dfbc-4913-8ad4-33750436e7a3
 id                                   | time_created                   | name                     | state                 
--------------------------------------+--------------------------------+--------------------------+-----------------------
 6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.726296 UTC | region-replacement-drive | SagaCachedState(Done) 
                             
                             saga id | event time                     | node id                                                        | event type | data
------------------------------------ | ------------------------------ | -------------------------------------------------------------- | ---------- | ---
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.734329 UTC |   8: start                                                     | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.740007 UTC |   8: start                                                     | succeeded  | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.855454 UTC |   0: common.uuid_generate                                      | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.859780 UTC |   0: common.uuid_generate                                      | succeeded  | "f2915112-f9f3-4e77-9910-adb0ba135518"
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.864585 UTC |   1: common.uuid_generate                                      | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.869536 UTC |   1: common.uuid_generate                                      | succeeded  | "33f5ded5-1164-47fa-96be-903653809e8f"
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.873994 UTC |   2: region_replacement_drive.set_saga_id                      | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.886260 UTC |   2: region_replacement_drive.set_saga_id                      | succeeded  | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.891085 UTC |   3: region_replacement_drive.drive_region_replacement_check   | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.903671 UTC |   3: region_replacement_drive.drive_region_replacement_check   | succeeded  | "ActionRequired"
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.907883 UTC |   4: region_replacement_drive.drive_region_replacement_prepare | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.920239 UTC |   4: region_replacement_drive.drive_region_replacement_prepare | succeeded  | {"Pantry":{"step":{"replacement_id":"06cd17c3-1308-4772-9c41-52eed351364f","step_associated_instance_id":null,"step_associated_pantry_ip":"fd00:1122:3344:101::7","step_associated_pantry_job_id":"33f5ded5-1164-47fa-96be-903653809e8f","step_associated_pantry_port":17000,"step_associated_vmm_id":null,"step_time":"2024-12-23T20:50:17.920167235Z","step_type":"Pantry"},"volume_id":"c9fd9286-4528-4212-be60-070178b3e093"}}
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:17.924432 UTC |   5: region_replacement_drive.drive_region_replacement_execute | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:18.592113 UTC |   5: region_replacement_drive.drive_region_replacement_execute | succeeded  | {"replacement_done":false,"step_to_commit":{"replacement_id":"06cd17c3-1308-4772-9c41-52eed351364f","step_associated_instance_id":null,"step_associated_pantry_ip":"fd00:1122:3344:101::7","step_associated_pantry_job_id":"33f5ded5-1164-47fa-96be-903653809e8f","step_associated_pantry_port":17000,"step_associated_vmm_id":null,"step_time":"2024-12-23T20:50:17.920167235Z","step_type":"Pantry"}}
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:18.595879 UTC |   6: region_replacement_drive.drive_region_replacement_commit  | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:18.642488 UTC |   6: region_replacement_drive.drive_region_replacement_commit  | succeeded  | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:18.646453 UTC |   7: region_replacement_drive.finish_saga                      | started    |  
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:18.658188 UTC |   7: region_replacement_drive.finish_saga                      | succeeded  |  
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:18.661107 UTC |   9: end                                                       | started    | 
6ce9262a-dfbc-4913-8ad4-33750436e7a3 | 2024-12-23 20:50:18.665064 UTC |   9: end                                                       | succeeded  | 

Then another action failed:

From id eb392f73-1035-4eca-b1fd-f08b456eb793
 id                                   | time_created                   | name                     | state                               
--------------------------------------+--------------------------------+--------------------------+-----------------------
 eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.756650 UTC | region-replacement-drive | SagaCachedState(Done) 

                             saga id | event time                     | node id                                   | event type    | data
------------------------------------ | ------------------------------ | ----------------------------------------- | ------------- | --- 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.902902 UTC |   8: start                                | started       | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.907156 UTC |   8: start                                | succeeded     | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.911137 UTC |   0: common.uuid_generate                 | started       | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.915695 UTC |   0: common.uuid_generate                 | succeeded     | "07b8ec6c-04f5-4603-b707-ae0fcbba17fb"
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.919476 UTC |   1: common.uuid_generate                 | started       | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.923297 UTC |   1: common.uuid_generate                 | succeeded     | "f8272993-539d-43d6-bb97-664bcfdf9f27"
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.929226 UTC |   2: region_replacement_drive.set_saga_id | started       | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.936473 UTC |   2: region_replacement_drive.set_saga_id | failed        | {"ActionFailed":{"source_error":{"Conflict":{"message":{"external_message":"region replacement 06cd17c3-1308-4772-9c41-52eed351364f set to Driving (operating saga id Some(f2915112-f9f3-4e77-9910-adb0ba135518))","internal_context":""}}}}}
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.946484 UTC |   1: common.uuid_generate                 | undo_started  | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.949533 UTC |   1: common.uuid_generate                 | undo_finished | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.953472 UTC |   0: common.uuid_generate                 | undo_started  | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.956671 UTC |   0: common.uuid_generate                 | undo_finished | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.959720 UTC |   8: start                                | undo_started  | 
eb392f73-1035-4eca-b1fd-f08b456eb793 | 2024-12-23 20:50:17.963697 UTC |   8: start                                | undo_finished | 

Then, a bunch more of those which appear to be repeating every 10 seconds. The last one I have in the log is this which still looks the same as other, that same ActionFailed kind of message:

From id c4d5344e-d811-47b1-bf69-58ad6477ad5d
 id                                   | time_created                   | name                     | state                              
--------------------------------------+--------------------------------+--------------------------+-----------------------             
 c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.937905 UTC | region-replacement-drive | SagaCachedState(Done)              

                             saga id | event time                     | node id                                   | event type    | data 
------------------------------------ | ------------------------------ | ----------------------------------------- | ------------- | ---
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.943977 UTC |   8: start                                | started       |    
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.947264 UTC |   8: start                                | succeeded     |    
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.949597 UTC |   0: common.uuid_generate                 | started       |    
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.952867 UTC |   0: common.uuid_generate                 | succeeded     | "7f00bb5b-2514-4e91-9534-2252822e7da7"
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.955803 UTC |   1: common.uuid_generate                 | started       | 
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.959183 UTC |   1: common.uuid_generate                 | succeeded     | "6de75735-6193-4e31-adba-95e53aaf28c4"
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.962123 UTC |   2: region_replacement_drive.set_saga_id | started       | 
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.970262 UTC |   2: region_replacement_drive.set_saga_id | failed        | {"ActionFailed":{"source_error":{"Conflict":{"message":{"external_message":"region replacement 06cd17c3-1308-4772-9c41-52eed351364f set to Driving (operating saga id Some(5e5e2352-0df4-47cf-96c5-e7d3c02deca5))","internal_context":""}}}}}
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.979392 UTC |   1: common.uuid_generate                 | undo_started  | 
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.982433 UTC |   1: common.uuid_generate                 | undo_finished |    
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.986091 UTC |   0: common.uuid_generate                 | undo_started  |    
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.989056 UTC |   0: common.uuid_generate                 | undo_finished |    
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.992060 UTC |   8: start                                | undo_started  |    
c4d5344e-d811-47b1-bf69-58ad6477ad5d | 2024-12-23 22:12:47.994964 UTC |   8: start                                | undo_finished | 

@askfongjojo askfongjojo added this to the 13 milestone Jan 3, 2025
@askfongjojo
Copy link

askfongjojo commented Jan 4, 2025

I tried expunging disks one at a time while also checking which omicron services had their zone datasets on the disk being expunged and ran blueprint updates (i.e., regenerate, target set, etc) after every disk expunge. I hit upon a case of unfinished region replacement when the physical disk hosted the nexus dataset and the particular nexus was handling the region replacement saga at the time it was expunged. I don't know if this would explain why the replacement couldn't finish. Here are some key events and related data:

These are the region replacement jobs triggered by the disk expungement - one of them failed to complete:

ID                                   REQUEST_TIME             REPLACEMENT_STATE 
22289c13-618a-47b1-be30-bb3d35cb99f9 2025-01-03T20:50:01.749Z Complete          
817ccdfb-a811-4606-95a1-8e9285dc1f38 2025-01-03T20:50:01.876Z Complete          
97b4b3f5-6bbd-4d32-869d-22a9965fd5e7 2025-01-03T20:50:01.828Z Complete          
b662c8a1-2224-440f-ba08-b95f1f2c6068 2025-01-03T20:50:01.480Z Complete          
bb0ffc32-e8ad-4053-864e-d1e889717f16 2025-01-03T20:50:01.457Z Complete          
c031de90-517a-4483-a748-9e05643dfadf 2025-01-03T20:50:02.611Z Running           
f9504da0-2641-4726-bc62-add84e1c5fae 2025-01-03T20:50:01.925Z Complete
root@[fd00:1122:3344:102::3]:32221/omicron> select * from region_replacement_step where replacement_id = 'c031de90-517a-4483-a748-9e05643dfadf';
             replacement_id            |           step_time           | step_type | step_associated_instance_id | step_associated_vmm_id | step_associated_pantry_ip | step_associated_pantry_port |    step_associated_pantry_job_id
---------------------------------------+-------------------------------+-----------+-----------------------------+------------------------+---------------------------+-----------------------------+---------------------------------------
  c031de90-517a-4483-a748-9e05643dfadf | 2025-01-03 20:50:06.604489+00 | pantry    | NULL                        | NULL                   | fd00:1122:3344:103::5     |                       17000 | 7d4d02f2-3793-4975-bb60-f86277faf3e6

The associated pantry started to return connection timeout errors at 20:50:18.075Z

20:50:18.075Z WARN crucible-pantry (datafile): connect timeout
     = downstairs
    client = 0
    session_id = ee2d6773-c7a5-4dc0-880d-73e965a45011
20:50:18.075Z WARN crucible-pantry (datafile): client task is sending Done(ConnectionTimeout)
     = downstairs
    client = 0
    session_id = ee2d6773-c7a5-4dc0-880d-73e965a45011

That is around the time when nexus zone was being expunged:

20:50:17.709Z DEBG e5ff372c-beaa-4f7c-b20c-f8af4d7957ee (ServerContext): client response
    SledAgent = f95d1ee6-4601-4ff2-937b-5fc67dcb4fd0
    background_task = blueprint_executor
    comment = sled d026352e-077d-4f6e-a31c-4f0b18b6d1ac: expunged 1 zones because: zone using expunged disk, sled d026352e-077d-4f6e-a31c-4f0b18b6d1ac: added 0 disks, updated 0, removed 0 disks
    result = Ok(Response { url: "http://[fd00:1122:3344:103::1]:12345/vpc/001de000-074c-4000-8000-000000000000/firewall/rules", status: 204, headers: {"x-request-id": "1186387a-731b-4ad4-aa5e-df28159fdb34", "date": "Fri, 03 Jan 2025 20:50:16 GMT"} })

The last few sagas processed by the nexus zone reaped was for the replacement id c031de90-517a-4483-a748-9e05643dfadf:

root@[fd00:1122:3344:102::3]:32221/omicron> select * from saga where current_sec = 'e5ff372c-beaa-4f7c-b20c-f8af4d7957ee' order by time_created desc limit 20;
                   id                  |               creator                |         time_created          |           name            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            saga_dag                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | saga_state |             current_sec              | adopt_generation |          adopt_time
---------------------------------------+--------------------------------------+-------------------------------+---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------+------------------+--------------------------------
  509d1e77-b8b6-4ab4-94cb-6456603166b1 | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee | 2025-01-03 20:51:23.686426+00 | region-replacement-drive  | {"end_node": 9, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [3, 4, null], [4, 5, null], [5, 6, null], [6, 7, null], [8, 0, null], [7, 9, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "common.uuid_generate", "label": "GenerateSagaId", "name": "saga_id"}}, {"Action": {"action_name": "common.uuid_generate", "label": "GenerateJobId", "name": "job_id"}}, {"Action": {"action_name": "region_replacement_drive.set_saga_id", "label": "SetSagaId", "name": "unused_1"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_check", "label": "DriveRegionReplacementCheck", "name": "check"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_prepare", "label": "DriveRegionReplacementPrepare", "name": "prepare"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_execute", "label": "DriveRegionReplacementExecute", "name": "execute"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_commit", "label": "DriveRegionReplacementCommit", "name": "commit"}}, {"Action": {"action_name": "region_replacement_drive.finish_saga", "label": "FinishSaga", "name": "unused_2"}}, {"Start": {"params": {"request": {"id": "c031de90-517a-4483-a748-9e05643dfadf", "new_region_id": "a0cc6041-0f3c-4317-a4fb-66aa54756147", "old_region_id": "ef11f480-7e81-44f0-bfd1-7c74e4044d00", "old_region_volume_id": "3ff62c06-7841-4123-a471-b46766cc1144", "operating_saga_id": null, "replacement_state": "Running", "request_time": "2025-01-03T20:50:02.611374Z", "volume_id": "bf37f97d-e041-4679-a5b2-cf766949a88d"}, "serialized_authn": {"kind": {"Authenticated": [{"actor": {"UserBuiltin": {"user_builtin_id": "001de000-05e4-4000-8000-000000000002"}}}, null]}}}}}, "End"]}, "saga_name": "region-replacement-drive", "start_node": 8}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | done       | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee |                1 | 2025-01-03 20:51:23.686426+00
  d727d13b-524d-49e3-aa5c-edef4c2d46bc | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee | 2025-01-03 20:51:13.719012+00 | region-replacement-drive  | {"end_node": 9, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [3, 4, null], [4, 5, null], [5, 6, null], [6, 7, null], [8, 0, null], [7, 9, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "common.uuid_generate", "label": "GenerateSagaId", "name": "saga_id"}}, {"Action": {"action_name": "common.uuid_generate", "label": "GenerateJobId", "name": "job_id"}}, {"Action": {"action_name": "region_replacement_drive.set_saga_id", "label": "SetSagaId", "name": "unused_1"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_check", "label": "DriveRegionReplacementCheck", "name": "check"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_prepare", "label": "DriveRegionReplacementPrepare", "name": "prepare"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_execute", "label": "DriveRegionReplacementExecute", "name": "execute"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_commit", "label": "DriveRegionReplacementCommit", "name": "commit"}}, {"Action": {"action_name": "region_replacement_drive.finish_saga", "label": "FinishSaga", "name": "unused_2"}}, {"Start": {"params": {"request": {"id": "c031de90-517a-4483-a748-9e05643dfadf", "new_region_id": "a0cc6041-0f3c-4317-a4fb-66aa54756147", "old_region_id": "ef11f480-7e81-44f0-bfd1-7c74e4044d00", "old_region_volume_id": "3ff62c06-7841-4123-a471-b46766cc1144", "operating_saga_id": null, "replacement_state": "Running", "request_time": "2025-01-03T20:50:02.611374Z", "volume_id": "bf37f97d-e041-4679-a5b2-cf766949a88d"}, "serialized_authn": {"kind": {"Authenticated": [{"actor": {"UserBuiltin": {"user_builtin_id": "001de000-05e4-4000-8000-000000000002"}}}, null]}}}}}, "End"]}, "saga_name": "region-replacement-drive", "start_node": 8}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | done       | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee |                1 | 2025-01-03 20:51:13.719012+00
  ff1d3ef8-d22b-43d3-aa6c-eed209e056ec | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee | 2025-01-03 20:51:03.60847+00  | region-replacement-drive  | {"end_node": 9, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [3, 4, null], [4, 5, null], [5, 6, null], [6, 7, null], [8, 0, null], [7, 9, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "common.uuid_generate", "label": "GenerateSagaId", "name": "saga_id"}}, {"Action": {"action_name": "common.uuid_generate", "label": "GenerateJobId", "name": "job_id"}}, {"Action": {"action_name": "region_replacement_drive.set_saga_id", "label": "SetSagaId", "name": "unused_1"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_check", "label": "DriveRegionReplacementCheck", "name": "check"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_prepare", "label": "DriveRegionReplacementPrepare", "name": "prepare"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_execute", "label": "DriveRegionReplacementExecute", "name": "execute"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_commit", "label": "DriveRegionReplacementCommit", "name": "commit"}}, {"Action": {"action_name": "region_replacement_drive.finish_saga", "label": "FinishSaga", "name": "unused_2"}}, {"Start": {"params": {"request": {"id": "c031de90-517a-4483-a748-9e05643dfadf", "new_region_id": "a0cc6041-0f3c-4317-a4fb-66aa54756147", "old_region_id": "ef11f480-7e81-44f0-bfd1-7c74e4044d00", "old_region_volume_id": "3ff62c06-7841-4123-a471-b46766cc1144", "operating_saga_id": null, "replacement_state": "Running", "request_time": "2025-01-03T20:50:02.611374Z", "volume_id": "bf37f97d-e041-4679-a5b2-cf766949a88d"}, "serialized_authn": {"kind": {"Authenticated": [{"actor": {"UserBuiltin": {"user_builtin_id": "001de000-05e4-4000-8000-000000000002"}}}, null]}}}}}, "End"]}, "saga_name": "region-replacement-drive", "start_node": 8}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | done       | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee |                1 | 2025-01-03 20:51:03.60847+00
  03e59f64-2a38-4cba-b173-b72720c349e0 | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee | 2025-01-03 20:51:03.602397+00 | region-replacement-drive  | {"end_node": 9, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [3, 4, null], [4, 5, null], [5, 6, null], [6, 7, null], [8, 0, null], [7, 9, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "common.uuid_generate", "label": "GenerateSagaId", "name": "saga_id"}}, {"Action": {"action_name": "common.uuid_generate", "label": "GenerateJobId", "name": "job_id"}}, {"Action": {"action_name": "region_replacement_drive.set_saga_id", "label": "SetSagaId", "name": "unused_1"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_check", "label": "DriveRegionReplacementCheck", "name": "check"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_prepare", "label": "DriveRegionReplacementPrepare", "name": "prepare"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_execute", "label": "DriveRegionReplacementExecute", "name": "execute"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_commit", "label": "DriveRegionReplacementCommit", "name": "commit"}}, {"Action": {"action_name": "region_replacement_drive.finish_saga", "label": "FinishSaga", "name": "unused_2"}}, {"Start": {"params": {"request": {"id": "817ccdfb-a811-4606-95a1-8e9285dc1f38", "new_region_id": "a928c178-6ee2-4cde-b432-5a6f23460862", "old_region_id": "e637a68b-6afb-4db8-809a-82be7380eeec", "old_region_volume_id": "f7430b3b-b640-4faf-bb5d-bd7520e02d46", "operating_saga_id": null, "replacement_state": "Running", "request_time": "2025-01-03T20:50:01.876097Z", "volume_id": "10ce4b87-8f2d-44f9-a084-e5846983c20a"}, "serialized_authn": {"kind": {"Authenticated": [{"actor": {"UserBuiltin": {"user_builtin_id": "001de000-05e4-4000-8000-000000000002"}}}, null]}}}}}, "End"]}, "saga_name": "region-replacement-drive", "start_node": 8}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | done       | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee |                1 | 2025-01-03 20:51:03.602397+00
  b761b0a4-d62a-403e-a178-95cffff518cd | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee | 2025-01-03 20:50:53.705703+00 | region-replacement-drive  | {"end_node": 9, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [3, 4, null], [4, 5, null], [5, 6, null], [6, 7, null], [8, 0, null], [7, 9, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "common.uuid_generate", "label": "GenerateSagaId", "name": "saga_id"}}, {"Action": {"action_name": "common.uuid_generate", "label": "GenerateJobId", "name": "job_id"}}, {"Action": {"action_name": "region_replacement_drive.set_saga_id", "label": "SetSagaId", "name": "unused_1"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_check", "label": "DriveRegionReplacementCheck", "name": "check"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_prepare", "label": "DriveRegionReplacementPrepare", "name": "prepare"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_execute", "label": "DriveRegionReplacementExecute", "name": "execute"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_commit", "label": "DriveRegionReplacementCommit", "name": "commit"}}, {"Action": {"action_name": "region_replacement_drive.finish_saga", "label": "FinishSaga", "name": "unused_2"}}, {"Start": {"params": {"request": {"id": "c031de90-517a-4483-a748-9e05643dfadf", "new_region_id": "a0cc6041-0f3c-4317-a4fb-66aa54756147", "old_region_id": "ef11f480-7e81-44f0-bfd1-7c74e4044d00", "old_region_volume_id": "3ff62c06-7841-4123-a471-b46766cc1144", "operating_saga_id": null, "replacement_state": "Running", "request_time": "2025-01-03T20:50:02.611374Z", "volume_id": "bf37f97d-e041-4679-a5b2-cf766949a88d"}, "serialized_authn": {"kind": {"Authenticated": [{"actor": {"UserBuiltin": {"user_builtin_id": "001de000-05e4-4000-8000-000000000002"}}}, null]}}}}}, "End"]}, "saga_name": "region-replacement-drive", "start_node": 8}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | done       | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee |                1 | 2025-01-03 20:50:53.705703+00
  951cc2f2-e2b3-4db7-92c7-59c74f81192c | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee | 2025-01-03 20:50:53.667261+00 | region-replacement-drive  | {"end_node": 9, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [3, 4, null], [4, 5, null], [5, 6, null], [6, 7, null], [8, 0, null], [7, 9, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "common.uuid_generate", "label": "GenerateSagaId", "name": "saga_id"}}, {"Action": {"action_name": "common.uuid_generate", "label": "GenerateJobId", "name": "job_id"}}, {"Action": {"action_name": "region_replacement_drive.set_saga_id", "label": "SetSagaId", "name": "unused_1"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_check", "label": "DriveRegionReplacementCheck", "name": "check"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_prepare", "label": "DriveRegionReplacementPrepare", "name": "prepare"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_execute", "label": "DriveRegionReplacementExecute", "name": "execute"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_commit", "label": "DriveRegionReplacementCommit", "name": "commit"}}, {"Action": {"action_name": "region_replacement_drive.finish_saga", "label": "FinishSaga", "name": "unused_2"}}, {"Start": {"params": {"request": {"id": "817ccdfb-a811-4606-95a1-8e9285dc1f38", "new_region_id": "a928c178-6ee2-4cde-b432-5a6f23460862", "old_region_id": "e637a68b-6afb-4db8-809a-82be7380eeec", "old_region_volume_id": "f7430b3b-b640-4faf-bb5d-bd7520e02d46", "operating_saga_id": null, "replacement_state": "Running", "request_time": "2025-01-03T20:50:01.876097Z", "volume_id": "10ce4b87-8f2d-44f9-a084-e5846983c20a"}, "serialized_authn": {"kind": {"Authenticated": [{"actor": {"UserBuiltin": {"user_builtin_id": "001de000-05e4-4000-8000-000000000002"}}}, null]}}}}}, "End"]}, "saga_name": "region-replacement-drive", "start_node": 8}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | done       | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee |                1 | 2025-01-03 20:50:53.667261+00
  9e50eaa3-4eed-469f-a038-f6bef4087cb3 | e5ff372c-beaa-4f7c-b20c-f8af4d7957ee | 2025-01-03 20:50:44.592674+00 | region-replacement-drive  | {"end_node": 9, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [3, 4, null], [4, 5, null], [5, 6, null], [6, 7, null], [8, 0, null], [7, 9, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "common.uuid_generate", "label": "GenerateSagaId", "name": "saga_id"}}, {"Action": {"action_name": "common.uuid_generate", "label": "GenerateJobId", "name": "job_id"}}, {"Action": {"action_name": "region_replacement_drive.set_saga_id", "label": "SetSagaId", "name": "unused_1"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_check", "label": "DriveRegionReplacementCheck", "name": "check"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_prepare", "label": "DriveRegionReplacementPrepare", "name": "prepare"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_execute", "label": "DriveRegionReplacementExecute", "name": "execute"}}, {"Action": {"action_name": "region_replacement_drive.drive_region_replacement_commit", "label": "DriveRegionReplacementCommit", "name": "commit"}}, {"Action": {"action_name": "region_replacement_drive.finish_saga", "label": "FinishSaga", "name": "unused_2"}}, {"Start": {"params": {"request": {"id": "c031de90-517a-4483-a748-9e05643dfadf", "new_region_id": "a0cc6041-0f3c-4317-a4fb-66aa54756147", "old_region_id": "ef11f480-7e81-44f0-bfd1-7c74e4044d00", "old_region_volume_id": "3ff62c06-7841-4123-a471-b46766cc1144", "operating_saga_id": null, "replacement_state": "Running", "request_time": "2025-01-03T20:50:02.611374Z", "volume_id": "bf37f97d-e041-4679-a5b2-cf766949a88d"}, "serialized_authn": {"kind": {"Authenticated": [{"actor": {"UserBuiltin": {"user_builtin_id": "001de000-05e4-4000-8000-000000000002"}}}, null]}}}}}, "End"]}, "saga_name": "region-replacement-drive", "start_node": 8}                                                                                                                                                                                                                                                                                                                                                        

After reconfigurator has replaced the expunged nexus with a new instance of nexus, pantry didn't recover from the connection timeout error. I tried bouncing the pantry SMF service and it didn't eliminate the error. I wonder if the timeout is due to a step that's stuck indefinitely because nexus went away.

@askfongjojo
Copy link

The nexus dependency is a red herring or a different issue. I retried sled expungement on london with a sled that has no nexus nor pantry zone but still got some region and snapshot replacement jobs stuck in running or allocating states. So far, these problem replacements are all related to detached disks.

@leftwo
Copy link
Contributor Author

leftwo commented Jan 6, 2025

The nexus dependency is a red herring or a different issue. I retried sled expungement on london with a sled that has no nexus nor pantry zone but still got some region and snapshot replacement jobs stuck in running or allocating states. So far, these problem replacements are all related to detached disks.

Yeah, I did not mention it in my initial reporting, but I had first added a sled, then expunged it, so it only had Crucible and NTP zones.

@leftwo
Copy link
Contributor Author

leftwo commented Jan 7, 2025

So, the above disk info actually points to one issue we have here, that will prevent repair from making forward progress.
When we have a disk that contains a downstairs that needs repair, and that disk is not attached to a running propolis, we will spin up an upstairs off the pantry and send that upstairs the VCR with the "fix" in it.
We then rely on the upstairs to repair the new downstairs during the initial reconciliation process that happens on activation at startup.
This process is fine when there is just one bad downstairs in a VCR, and once the pantry upstairs has completed activation, we can tell nexus that all is good and the disk is repaired.

However, we don't have a simple disk in this case. We have a VCR with a RW sub-volume, and a tree of read only parents.

From above:

root@oxz_switch1:~# omdb db disks info bef4889d-e590-458e-81ec-b997e18f3760
HOST_SERIAL DISK_NAME INSTANCE_NAME PROPOLIS_ZONE VOLUME_ID                            DISK_STATE
-           disk-20   -             -             c9fd9286-4528-4212-be60-070178b3e093 detached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220062 f153ee6c-910d-4628-8350-ba9022bb5158 oxz_crucible_31942354-90cc-4f03-86c8-31b5c8ab33a7 591b6235-01c7-41bc-ba37-03d3c1b1e9dc
BRM42220030 452b7095-312e-4d1a-9285-4ef344d09df4 oxz_crucible_3b82acd4-61e0-421b-921d-44b042254c69 544ed702-38fe-4314-b7c5-8de94cc19304
BRM42220030 ccc72284-7f4e-4ae6-acd3-3edbef635fff oxz_crucible_ddb565c5-7cdc-406c-9abc-b396bb5effd3 e2f72669-bb24-4f3d-bc1a-cebc7f9d2aad

VCR from volume ID c9fd9286-4528-4212-be60-070178b3e093
ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT
bef4889d-e590-458e-81ec-b997e18f3760 512 1           true

SUB VOLUME 0
    ID                                   BS  BPE    EC   GEN READ_ONLY
    bef4889d-e590-458e-81ec-b997e18f3760 512 131072 3200 5   false
    [fd00:1122:3344:102::e]:19007
    [fd00:1122:3344:102::8]:19006
    [fd00:1122:3344:103::7]:19003

READ ONLY PARENT:
    ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT
    fef63546-9a7b-426b-ab2c-e72f6a3283ca 512 1           true

    SUB VOLUME 0
        ID                                   BS  BPE    EC   GEN READ_ONLY
        fd7b68c1-268f-4510-8d28-def8c6fddc95 512 131072 3200 2   true
        [fd00:1122:3344:101::b]:19005
        [fd00:1122:3344:102::d]:19005
        [fd00:1122:3344:121::24]:19008
        
    READ ONLY PARENT:
        ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT
        67fe02af-af39-413e-9c60-d6d3af08d66d 512 1           true
        
        SUB VOLUME 0
            ID                                   BS  BPE    EC   GEN READ_ONLY
            39bbc2e1-f5eb-440d-9815-962fafe1fabf 512 131072 3200 2   true
            [fd00:1122:3344:121::28]:19005
            [fd00:1122:3344:101::a]:19012
            [fd00:1122:3344:102::d]:19003

How the pantry (or propolis) works with multi level VCR is that it spins up an upstairs instance for each sub-volume (of which there is currently only ever one) and one upstairs for each level of the read only parent. In order for the pantry to consider a repair completed, all the upstairs instances have to make it through activation. In the VCR received by the pantry, the pantry does not know which sub-volume or read only parent level is the one that we want to repair, so it activates (or tries to activate) all of them.

And, here is the problem. While the upstairs instance that needed to do a repair will do so and activate, there are other parts of this VCR that also need to be repaired, (in the example above, any layer that has a fd00:1122:3344:121: address is on the expunged sled) and these instances will not be able to activate as they cannot contact the expunged sled.

This also explains the dtrace output that found an upstairs instance with two WQ and one NEW:

oxz_crucible_pantry_47fa71bb 23599 2e3546f5 NEW  WQ  WQ

This would be an upstairs instance that was still trying to contact a downstairs on an expunged sled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants