Volume failed to attach: No online replicas are available #1728

veenadong · 2024-08-27T17:32:41Z

Mayastor 2.5.1:

Pod is failing to run with:

Events:
  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Normal   Scheduled           3m59s                default-scheduler        Successfully assigned cassandra/cassandra-dc1-rack1-sts-0 to onprem-node45.hstlabs.glcp.hpecorp.net
  Warning  FailedAttachVolume  70s (x2 over 3m13s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f" : rpc error: code = Internal desc = Operation failed: PreconditionFailed("error in response: status code '412 Precondition Failed', content: 'RestJsonError { details: \"\", message: \"SvcError :: NoOnlineReplicas: No online replicas are available for Volume '6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f'\", kind: FailedPrecondition }'")

etcd shows:

kubectl exec mayastor-etcd-0 -n mayastor -c etcd -- etcdctl get "/openebs.io/mayastor/apis/v0/clusters/7c79961b-b100-4f14-91ec-9c1e697362a4/namespaces/mayastor/volume/6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f/nexus/2796f518-e69f-4986-ba7c-59e26073b50b/info"
/openebs.io/mayastor/apis/v0/clusters/7c79961b-b100-4f14-91ec-9c1e697362a4/namespaces/mayastor/volume/6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f/nexus/2796f518-e69f-4986-ba7c-59e26073b50b/info
{"children":[],"clean_shutdown":true}

/openebs.io/mayastor/apis/v0/clusters/7c79961b-b100-4f14-91ec-9c1e697362a4/namespaces/mayastor/VolumeSpec/6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f
{"uuid":"6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f","size":5368709120,"labels":null,"num_replicas":2,"status":{"Created":"Online"},"policy":{"self_heal":true},"topology":{"node":null,"pool":{"Labelled":{"exclusion":{},"inclusion":{"openebs.io/created-by":"operator-diskpool"}}}},"last_nexus_id":null,"operation":null,"thin":true,"target":{"node":"onprem-node45.hstlabs.glcp.hpecorp.net","nexus":"2796f518-e69f-4986-ba7c-59e26073b50b","protocol":"nvmf","active":false,"config":{"controllerIdRange":{"start":7,"end":8},"reservationKey":13437714216982394123,"reservationType":"ExclusiveAccess","preemptPolicy":"Holder"},"frontend":{"host_acl":[{"node_name":"onprem-node45.hstlabs.glcp.hpecorp.net","node_nqn":"nqn.2019-05.io.openebs:node-name:onprem-node45.hstlabs.glcp.hpecorp.net"}]}},"publish_context":{"ioTimeout":"30"},"affinity_group":null}

Attaching the support bundle
mayastor-no-online.tar.gz

The text was updated successfully, but these errors were encountered:

dsharma-dc · 2024-08-28T05:35:43Z

Errors like these on two of the io-engine node logs(node 46 and 47) explain the missing info in etcd. One thing which could've possibly lead to this is that the persist during nexus create didn't succeed, and later on persist during the nexus shutdown succeeded.

[2024-08-27T03:56:40.683061372+00:00  WARN io_engine::persistent_store:persistent_store.rs:332] Attempting to reconnect to persistent store....
[2024-08-27T03:56:40.683125208+00:00  INFO io_engine::persistent_store:persistent_store.rs:162] Connected to etcd on endpoint mayastor-etcd:2379
[2024-08-27T03:56:40.683148950+00:00 ERROR io_engine::bdev::nexus::nexus_persistence:nexus_persistence.rs:237] Nexus '4760d15d-9a02-4e66-8c0c-bf0c574bfc72' [open]: failed to persist nexus information, will retry silently (99 left): Store operation timed out....

There are lot of other connectivity errors on these io-engines. e.g NATS connection has been lost , indicating some network issues.

veenadong · 2024-08-28T16:41:09Z

One of the node in this cluster of 3 nodes was rebooted.

veenadong · 2024-08-29T23:04:40Z

We ran into another issue where the NexusSpec in etcd is missing.
Missing entry is related to:

/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/volume/4e01a5bc-d532-4186-a4ec-ab0b689a8e44/nexus/8f5d061c-82f1-46bb-8bd1-c4164573e1da/info

The related key:
/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/NexusSpec/8f5d061c-82f1-46bb-8bd1-c4164573e1da is not present, causing CSI not able to attach the volume.

What is the best way detect this error and recover?

mayastor-2024-08-29-missing-NexusSpec.tar.gz

tiagolobocastro · 2024-08-29T23:36:48Z

hmm it does seem present:

/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/volume/4e01a5bc-d532-4186-a4ec-ab0b689a8e44/nexus/8f5d061c-82f1-46bb-8bd1-c4164573e1da/info:
{
  "children": [
    {
      "healthy": true,
      "uuid": "cda11680-b417-44d2-a2c6-88a111760c90"
    },
    {
      "healthy": true,
      "uuid": "a11e5b53-19df-4597-813e-19b576317211"
    }
  ],
  "clean_shutdown": true
}

There are several connection issues which are fixed, I suggest you upgrade to 2.7.0 (though also seeing a bug which is fixed and will be released in v2.7.1 soon) - anyway 2.7.0 will already be better.

veenadong · 2024-08-30T17:07:14Z

@tiagolobocastro
The key that's missing is:

/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/NexusSpec/8f5d061c-82f1-46bb-8bd1-c4164573e1da

The following is the command ran while in the error state:

core@sc-os-160-node2:~/mayastor-2024-08-29--17-47-43-UTC$ kubectl exec -n mayastor mayastor-etcd-0 -- etcdctl get "" --prefix=true | grep 8f5d061c-82f1-46bb-8bd1-c4164573e1da
Defaulted container "etcd" out of: etcd, volume-permissions (init)
{"uuid":"4e01a5bc-d532-4186-a4ec-ab0b689a8e44","size":5368709120,"labels":null,"num_replicas":2,"status":{"Created":"Online"},"policy":{"self_heal":true},"topology":{"node":null,"pool":{"Labelled":{"exclusion":{},"inclusion":{"openebs.io/created-by":"operator-diskpool"}}}},"last_nexus_id":null,"operation":null,"thin":true,"target":{"node":"sc-os-160-node2.glcpdev.cloud.hpe.com","nexus":"8f5d061c-82f1-46bb-8bd1-c4164573e1da","protocol":"nvmf","active":true,"config":{"controllerIdRange":{"start":11,"end":12},"reservationKey":10075049441338057178,"reservationType":"ExclusiveAccess","preemptPolicy":"Holder"},"frontend":{"host_acl":[{"node_name":"sc-os-160-node2.glcpdev.cloud.hpe.com","node_nqn":"nqn.2019-05.io.openebs:node-name:sc-os-160-node2.glcpdev.cloud.hpe.com"}]}},"publish_context":{"ioTimeout":"30"},"affinity_group":null}
/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/volume/4e01a5bc-d532-4186-a4ec-ab0b689a8e44/nexus/8f5d061c-82f1-46bb-8bd1-c4164573e1da/info

Agent-core has error about missing Nexus:

  �[2m2024-08-29T02:58:04.291465Z�[0m �[31mERROR�[0m �[1;31mcore::volume::service�[0m�[31m: �[1;31merror�[0m�[31m: Nexus '8f5d061c-82f1-46bb-8bd1-c4164573e1da' not found�[0m
    �[2;3mat�[0m control-plane/agents/src/bin/core/volume/service.rs:360

tiagolobocastro · 2024-08-30T23:02:09Z

You can try to scale down the app trying to use the volume.
When scaling down the volume should be "reset" to not expecting nexus to be present.
Then you can scale up again.

veenadong · 2024-09-02T17:25:13Z

You can try to scale down the app trying to use the volume. When scaling down the volume should be "reset" to not expecting nexus to be present. Then you can scale up again.

Tried deleting the pod. If the pod came back on the same node, the nexus didn't "reset". The "reset" only happens if the pod is schedule to a different node.

Regarding upgrade to 2.7.0, the mayastor plugin not recovering after 1 node down is really a blocker for us to upgrade ( I filed an issue for it).
We observed a similar issue in 2.5.1, mayastor plugin recovered after 10 mins or so. Is there some timeout settings that we can change to reduce the time (both for 2.5.1 and 2.7.0)?

tiagolobocastro · 2024-09-06T09:38:24Z

Tried deleting the pod. If the pod came back on the same node, the nexus didn't "reset". The "reset" only happens if the pod is schedule to a different node.

So the nexus did not get deleted?

mayastor plugin not recovering after 1 node down is really a blocker for us to upgrade ( I filed an issue for it). We observed a similar issue in 2.5.1, mayastor plugin recovered after 10 mins or so. Is there some timeout settings that we can change to reduce the time (both for 2.5.1 and 2.7.0)?

Is it this issue? #1715
You mean for 10 minutes the mayastor plugin is not able to retrieve any info?
Are you still using replicas>1 for api-rest? Perhaps the plugin does not work very well in that scenario, let me test this.

tiagolobocastro · 2024-09-06T11:09:27Z

When you run with replicas>1 the mayastor-plugin requests can hit any of the api-rest pods
When a node where an api-rest pod goes down then those requests may timeout as the node is down ...
In my case after ~1 minute the api-rest pod from the node down goes into terminating state, and another pod is started.

At this point now all api-rest requests succeed again.
I wonder where your 10 minutes come from, is this something you can reproduce? I suggest trying it on a fresh test cluster.

avishnu assigned dsharma-dc Nov 12, 2024

tiagolobocastro closed this as completed Dec 29, 2024

Abhinandan-Purkait added this to the Mayastor v2.7 milestone Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume failed to attach: No online replicas are available #1728

Volume failed to attach: No online replicas are available #1728

veenadong commented Aug 27, 2024

dsharma-dc commented Aug 28, 2024 •

edited

Loading

veenadong commented Aug 28, 2024

veenadong commented Aug 29, 2024

tiagolobocastro commented Aug 29, 2024

veenadong commented Aug 30, 2024 •

edited

Loading

tiagolobocastro commented Aug 30, 2024

veenadong commented Sep 2, 2024

tiagolobocastro commented Sep 6, 2024

tiagolobocastro commented Sep 6, 2024

Volume failed to attach: No online replicas are available #1728

Volume failed to attach: No online replicas are available #1728

Comments

veenadong commented Aug 27, 2024

dsharma-dc commented Aug 28, 2024 • edited Loading

veenadong commented Aug 28, 2024

veenadong commented Aug 29, 2024

tiagolobocastro commented Aug 29, 2024

veenadong commented Aug 30, 2024 • edited Loading

tiagolobocastro commented Aug 30, 2024

veenadong commented Sep 2, 2024

tiagolobocastro commented Sep 6, 2024

tiagolobocastro commented Sep 6, 2024

dsharma-dc commented Aug 28, 2024 •

edited

Loading

veenadong commented Aug 30, 2024 •

edited

Loading