Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkNotFound error preventing server creation in VexxHost OpenStack #1061

Open
marmijo opened this issue Nov 22, 2024 · 1 comment
Open

Comments

@marmijo
Copy link
Member

marmijo commented Nov 22, 2024

Description

A NetworkNotFound error was seen in the FCOS OpenStack instance on VexxHost today. The error caused all server creation to fail in OpenStack, either in the kola-openstack job or locally through the CLI.

harness.go:1782: Cluster failed starting machines: waiting for instance to run: 
Server reported ERROR status: 
{500 2024-11-22 20:56:47 +0000 UTC  Build of instance 8837696b-6172-4721-aa4a-729e576573d5 aborted: Failed to allocate the network(s), not rescheduling.

The issue seems to be that the private network we attach to the servers is not available, but the network looks fine in the CLI and the cloud console.

When creating a server using openstack server create --debug --network=private <other server creation arguments> , the following debug message can be seen:

RESP BODY: {"NeutronError": {"type": "NetworkNotFound", "message": "Network private could not be found.", "detail": ""}}

The instance fails to launch and the error shows that the private network cannot be found. However:

  • The private network exists and appears healthy (openstack network show private confirms this, as does the cloud console).
  • The network has a valid subnet and sufficient IP addresses.
  • Instances launched yesterday (2024-11-21) using this network were functional.

Additional Information

OpenStack region doesn't seem to make a difference

The failure was seen using the ca-ymq-1 region in OpenStack, but I also saw the error when I tried creating a server in ams1 as well.

Timing of failure

We saw a successful kola-openstack run on 2024-11-22 8:54 UTC, but then saw the following error in a kola-openstack run at 2024-11-22 8:13 UTC

[2024-11-22T08:11:42.185Z] + ore openstack --config-file=**** --region=ca-ymq-1 create-image --file=/home/jenkins/agent/workspace/kola-openstack/builds/41.20241119.20.1/aarch64/fedora-coreos-41.20241119.20.1-openstack.aarch64.qcow2 --name=kola-fedora-coreos-testing-devel-aarch64 --arch=aarch64
[2024-11-22T08:12:50.015Z] Couldn't create image: creating image: Expected HTTP response code [201] when accessing [POST https://image.public.mtl1.vexxhost.net/v2/images], but got 504 instead
[2024-11-22T08:12:50.015Z] <html>
[2024-11-22T08:12:50.015Z] <head><title>504 Gateway Time-out</title></head>
[2024-11-22T08:12:50.015Z] <body>
[2024-11-22T08:12:50.015Z] <center><h1>504 Gateway Time-out</h1></center>
[2024-11-22T08:12:50.015Z] <hr><center>nginx</center>
[2024-11-22T08:12:50.015Z] </body>
[2024-11-22T08:12:50.015Z] </html>

We then started seeing these failures on all runs afterwards starting at 2024-11-22 17:52 UTC

Potentially there were some stability issues this morning that could have affected our host.
VexxHost Status seems green though: https://status.vexxhost.com/

Nova Compute Logs

I searched for similar instances of this failure and articles/forums point towards checking the Nova Compute Logs at /var/log/nova/nova-compute.log and running a command as root on the host to resolve the issue. However, we dont have access to the host resources.

marmijo added a commit to marmijo/fedora-coreos-pipeline that referenced this issue Nov 22, 2024
All openstack runs started failing today due to
coreos#1061.
Let's disable the mechanism to kick off openstack jobs for a few days
to see if the issue is resolved after the weekend.
marmijo added a commit to marmijo/fedora-coreos-pipeline that referenced this issue Nov 22, 2024
All openstack runs started failing today due to
coreos#1061.
Let's disable the mechanism to kick off openstack jobs for a few days
to see if the issue is resolved after the weekend.
@marmijo
Copy link
Member Author

marmijo commented Nov 22, 2024

PR to disable kola-openstack in the pipeline for the weekend: #1062

marmijo added a commit to marmijo/fedora-coreos-pipeline that referenced this issue Nov 22, 2024
All openstack runs started failing today due to
coreos#1061
Stop running tests on openstack for the next few days to see
if the issue is resolved after the weekend.
marmijo added a commit to marmijo/fedora-coreos-pipeline that referenced this issue Nov 22, 2024
All openstack runs started failing today due to
coreos#1061
Stop running tests on openstack for the next few days to see
if the issue is resolved after the weekend.
marmijo added a commit to marmijo/fedora-coreos-pipeline that referenced this issue Nov 22, 2024
All openstack runs started failing today due to
coreos#1061
Stop running tests on openstack for the next few days to see
if the issue is resolved after the weekend.
marmijo added a commit to marmijo/fedora-coreos-pipeline that referenced this issue Nov 22, 2024
All openstack runs started failing today due to
coreos#1061
Stop running tests on openstack for the next few days to see
if the issue is resolved after the weekend.
dustymabe pushed a commit that referenced this issue Nov 22, 2024
All openstack runs started failing today due to
#1061
Stop running tests on openstack for the next few days to see
if the issue is resolved after the weekend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant