Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume failed to attach: No online replicas are available #1728

Closed
veenadong opened this issue Aug 27, 2024 · 9 comments
Closed

Volume failed to attach: No online replicas are available #1728

veenadong opened this issue Aug 27, 2024 · 9 comments
Assignees
Milestone

Comments

@veenadong
Copy link

Mayastor 2.5.1:

Pod is failing to run with:

Events:
  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Normal   Scheduled           3m59s                default-scheduler        Successfully assigned cassandra/cassandra-dc1-rack1-sts-0 to onprem-node45.hstlabs.glcp.hpecorp.net
  Warning  FailedAttachVolume  70s (x2 over 3m13s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f" : rpc error: code = Internal desc = Operation failed: PreconditionFailed("error in response: status code '412 Precondition Failed', content: 'RestJsonError { details: \"\", message: \"SvcError :: NoOnlineReplicas: No online replicas are available for Volume '6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f'\", kind: FailedPrecondition }'")

etcd shows:

kubectl exec mayastor-etcd-0 -n mayastor -c etcd -- etcdctl get "/openebs.io/mayastor/apis/v0/clusters/7c79961b-b100-4f14-91ec-9c1e697362a4/namespaces/mayastor/volume/6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f/nexus/2796f518-e69f-4986-ba7c-59e26073b50b/info"
/openebs.io/mayastor/apis/v0/clusters/7c79961b-b100-4f14-91ec-9c1e697362a4/namespaces/mayastor/volume/6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f/nexus/2796f518-e69f-4986-ba7c-59e26073b50b/info
{"children":[],"clean_shutdown":true}

/openebs.io/mayastor/apis/v0/clusters/7c79961b-b100-4f14-91ec-9c1e697362a4/namespaces/mayastor/VolumeSpec/6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f
{"uuid":"6e9bf08b-0ccf-4a37-a2fd-551a7a7f531f","size":5368709120,"labels":null,"num_replicas":2,"status":{"Created":"Online"},"policy":{"self_heal":true},"topology":{"node":null,"pool":{"Labelled":{"exclusion":{},"inclusion":{"openebs.io/created-by":"operator-diskpool"}}}},"last_nexus_id":null,"operation":null,"thin":true,"target":{"node":"onprem-node45.hstlabs.glcp.hpecorp.net","nexus":"2796f518-e69f-4986-ba7c-59e26073b50b","protocol":"nvmf","active":false,"config":{"controllerIdRange":{"start":7,"end":8},"reservationKey":13437714216982394123,"reservationType":"ExclusiveAccess","preemptPolicy":"Holder"},"frontend":{"host_acl":[{"node_name":"onprem-node45.hstlabs.glcp.hpecorp.net","node_nqn":"nqn.2019-05.io.openebs:node-name:onprem-node45.hstlabs.glcp.hpecorp.net"}]}},"publish_context":{"ioTimeout":"30"},"affinity_group":null}

Attaching the support bundle
mayastor-no-online.tar.gz

@dsharma-dc
Copy link
Contributor

dsharma-dc commented Aug 28, 2024

Errors like these on two of the io-engine node logs(node 46 and 47) explain the missing info in etcd. One thing which could've possibly lead to this is that the persist during nexus create didn't succeed, and later on persist during the nexus shutdown succeeded.

[2024-08-27T03:56:40.683061372+00:00  WARN io_engine::persistent_store:persistent_store.rs:332] Attempting to reconnect to persistent store....
[2024-08-27T03:56:40.683125208+00:00  INFO io_engine::persistent_store:persistent_store.rs:162] Connected to etcd on endpoint mayastor-etcd:2379
[2024-08-27T03:56:40.683148950+00:00 ERROR io_engine::bdev::nexus::nexus_persistence:nexus_persistence.rs:237] Nexus '4760d15d-9a02-4e66-8c0c-bf0c574bfc72' [open]: failed to persist nexus information, will retry silently (99 left): Store operation timed out....

There are lot of other connectivity errors on these io-engines. e.g NATS connection has been lost , indicating some network issues.

@veenadong
Copy link
Author

One of the node in this cluster of 3 nodes was rebooted.

@veenadong
Copy link
Author

We ran into another issue where the NexusSpec in etcd is missing.
Missing entry is related to:

/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/volume/4e01a5bc-d532-4186-a4ec-ab0b689a8e44/nexus/8f5d061c-82f1-46bb-8bd1-c4164573e1da/info

The related key:
/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/NexusSpec/8f5d061c-82f1-46bb-8bd1-c4164573e1da is not present, causing CSI not able to attach the volume.

What is the best way detect this error and recover?

mayastor-2024-08-29-missing-NexusSpec.tar.gz

@tiagolobocastro
Copy link
Contributor

hmm it does seem present:

/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/volume/4e01a5bc-d532-4186-a4ec-ab0b689a8e44/nexus/8f5d061c-82f1-46bb-8bd1-c4164573e1da/info:
{
  "children": [
    {
      "healthy": true,
      "uuid": "cda11680-b417-44d2-a2c6-88a111760c90"
    },
    {
      "healthy": true,
      "uuid": "a11e5b53-19df-4597-813e-19b576317211"
    }
  ],
  "clean_shutdown": true
}

There are several connection issues which are fixed, I suggest you upgrade to 2.7.0 (though also seeing a bug which is fixed and will be released in v2.7.1 soon) - anyway 2.7.0 will already be better.

@veenadong
Copy link
Author

veenadong commented Aug 30, 2024

@tiagolobocastro
The key that's missing is:

/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/NexusSpec/8f5d061c-82f1-46bb-8bd1-c4164573e1da

The following is the command ran while in the error state:

core@sc-os-160-node2:~/mayastor-2024-08-29--17-47-43-UTC$ kubectl exec -n mayastor mayastor-etcd-0 -- etcdctl get "" --prefix=true | grep 8f5d061c-82f1-46bb-8bd1-c4164573e1da
Defaulted container "etcd" out of: etcd, volume-permissions (init)
{"uuid":"4e01a5bc-d532-4186-a4ec-ab0b689a8e44","size":5368709120,"labels":null,"num_replicas":2,"status":{"Created":"Online"},"policy":{"self_heal":true},"topology":{"node":null,"pool":{"Labelled":{"exclusion":{},"inclusion":{"openebs.io/created-by":"operator-diskpool"}}}},"last_nexus_id":null,"operation":null,"thin":true,"target":{"node":"sc-os-160-node2.glcpdev.cloud.hpe.com","nexus":"8f5d061c-82f1-46bb-8bd1-c4164573e1da","protocol":"nvmf","active":true,"config":{"controllerIdRange":{"start":11,"end":12},"reservationKey":10075049441338057178,"reservationType":"ExclusiveAccess","preemptPolicy":"Holder"},"frontend":{"host_acl":[{"node_name":"sc-os-160-node2.glcpdev.cloud.hpe.com","node_nqn":"nqn.2019-05.io.openebs:node-name:sc-os-160-node2.glcpdev.cloud.hpe.com"}]}},"publish_context":{"ioTimeout":"30"},"affinity_group":null}
/openebs.io/mayastor/apis/v0/clusters/1013e263-e0ba-48b2-ae78-52d51b5da9c8/namespaces/mayastor/volume/4e01a5bc-d532-4186-a4ec-ab0b689a8e44/nexus/8f5d061c-82f1-46bb-8bd1-c4164573e1da/info

Agent-core has error about missing Nexus:

  �[2m2024-08-29T02:58:04.291465Z�[0m �[31mERROR�[0m �[1;31mcore::volume::service�[0m�[31m: �[1;31merror�[0m�[31m: Nexus '8f5d061c-82f1-46bb-8bd1-c4164573e1da' not found�[0m
    �[2;3mat�[0m control-plane/agents/src/bin/core/volume/service.rs:360

@tiagolobocastro
Copy link
Contributor

You can try to scale down the app trying to use the volume.
When scaling down the volume should be "reset" to not expecting nexus to be present.
Then you can scale up again.

@veenadong
Copy link
Author

You can try to scale down the app trying to use the volume. When scaling down the volume should be "reset" to not expecting nexus to be present. Then you can scale up again.

Tried deleting the pod. If the pod came back on the same node, the nexus didn't "reset". The "reset" only happens if the pod is schedule to a different node.

Regarding upgrade to 2.7.0, the mayastor plugin not recovering after 1 node down is really a blocker for us to upgrade ( I filed an issue for it).
We observed a similar issue in 2.5.1, mayastor plugin recovered after 10 mins or so. Is there some timeout settings that we can change to reduce the time (both for 2.5.1 and 2.7.0)?

@tiagolobocastro
Copy link
Contributor

Tried deleting the pod. If the pod came back on the same node, the nexus didn't "reset". The "reset" only happens if the pod is schedule to a different node.

So the nexus did not get deleted?

mayastor plugin not recovering after 1 node down is really a blocker for us to upgrade ( I filed an issue for it). We observed a similar issue in 2.5.1, mayastor plugin recovered after 10 mins or so. Is there some timeout settings that we can change to reduce the time (both for 2.5.1 and 2.7.0)?

Is it this issue? #1715
You mean for 10 minutes the mayastor plugin is not able to retrieve any info?
Are you still using replicas>1 for api-rest? Perhaps the plugin does not work very well in that scenario, let me test this.

@tiagolobocastro
Copy link
Contributor

When you run with replicas>1 the mayastor-plugin requests can hit any of the api-rest pods
When a node where an api-rest pod goes down then those requests may timeout as the node is down ...
In my case after ~1 minute the api-rest pod from the node down goes into terminating state, and another pod is started.

At this point now all api-rest requests succeed again.
I wonder where your 10 minutes come from, is this something you can reproduce? I suggest trying it on a fresh test cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants