Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling pool update does not resume after reboot. #7613

Open
tuxpowered opened this issue Apr 29, 2024 · 4 comments
Open

Rolling pool update does not resume after reboot. #7613

tuxpowered opened this issue Apr 29, 2024 · 4 comments
Assignees

Comments

@tuxpowered
Copy link

tuxpowered commented Apr 29, 2024

Are you using XOA or XO from the sources?

XO from the sources

Which release channel?

None

Provide your commit number

0794a

Describe the bug

When performing a "Rolling Update" on an HA cluster, XO proceeds to migrate all VM's off the primary node to other nodes. (good). The primary node then issues a reboot, however when it comes back on line the other nodes in the HA cluster do not resume downloading and applying patches.

Error message

Text
From Settings > Logs:

server.enable
{
  "id": "0bce7468-93e5-4376-93c1-c75082f8f436"
}
{
  "name": "ConnectTimeoutError",
  "code": "UND_ERR_CONNECT_TIMEOUT",
  "call": {
    "method": "session.login_with_password",
    "params": "* obfuscated *"
  },
  "message": "Connect Timeout Error",
  "stack": "ConnectTimeoutError: Connect Timeout Error
    at onConnectTimeout (/opt/xen-orchestra/node_modules/undici/lib/core/connect.js:190:24)
    at /opt/xen-orchestra/node_modules/undici/lib/core/connect.js:133:46
    at Immediate._onImmediate (/opt/xen-orchestra/node_modules/undici/lib/core/connect.js:174:9)
    at processImmediate (node:internal/timers:476:21)
    at process.callbackTrampoline (node:internal/async_hooks:128:17)"
}
pool.rollingUpdate
{
  "pool": "62d8471c-e515-0d7a-d77f-5ac38a945507"
}
{
  "message": "Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart",
  "name": "Error",
  "stack": "Error: Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart
    at Xapi.rollingPoolReboot (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:127:9)
    at Xapi.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/patching.mjs:501:5)
    at XenServers.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/xen-servers.mjs:689:5)
    at Xo.rollingUpdate (file:///opt/xen-orchestra/packages/xo-server/src/api/pool.mjs:231:3)
    at Api.#callApiMethod (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/api.mjs:366:20)"
}

To reproduce

  1. Go to 'Home > Pools > Select HA Pool'
  2. Click on 'Patches > Rolling pool Update'
  3. See error (non displayed review logs)

Expected behavior

On reboot of the primary node, the migration of VM's back should resume and the process should go on to the next pool and repeat

Screenshots

No response

Node

18.20.0

Hypervisor

8.2.1

Additional context

It appears that the HA Master properly has VM's migrated and patches applied first.
Systems all have 10GB dedicated storage and 1GB interface for VM access and management.

@Danp2
Copy link
Collaborator

Danp2 commented Apr 29, 2024

commit number 0794a

You are about a month behind on updates. Also, have you seen the latest revisions to the documentation where it explains how to increase the timeout period? https://xen-orchestra.com/docs/manage_infrastructure.html#rolling-pool-updates-rpu

@tuxpowered
Copy link
Author

Oh wow, that far behind already? Seems like it was just a few weeks ago I updated.
Did not see the timeout update. I will update and review.
It is odd because I have 2 clusters one updates fine np the other has an issue (just started testing the other cluter)

@Danp2
Copy link
Collaborator

Danp2 commented Apr 30, 2024

[Rolling Pool Update/Reboot] Use XO tasks for better reportability (PR #7578)

This was merged earlier today, which will make monitoring the RPU much easier.

@b-Nollet
Copy link
Contributor

We've recently made some changes to the RPU, including a fix for a bug introduced by the release earlier this month.
Can you update to the latest version and test if the problem is still present? (and provide us with the XO task logs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants