Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ironic health checks do not check against useful requests #1528

Open
dankingtech opened this issue Jan 22, 2024 · 8 comments
Open

Ironic health checks do not check against useful requests #1528

dankingtech opened this issue Jan 22, 2024 · 8 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@dankingtech
Copy link

What steps did you take and what happened:

By any method, cause the connection to the database to fail. Even though any requests to do actual work will fail, the deployment will be seen as live and ready because the health checks are only checking to see that the base URL is responding which it can do even if the internal connections are down. For instance, if one were to attempt to connect to http://127.0.0.1:6385/v1/nodes/ or other such endpoints, there may be an error but Kubernetes does not know about it.

What did you expect to happen:

Ideally, the liveness probe should detect that at least some other endpoint is successful which relies upon the database connection note that the Ironic instance is not healthy.

Anything else you would like to add:

Unfortunately, I have noticed that there are various occasions when Ironic, for various reasons, may fail to be able to connect to the database. In the past I have seen this caused by the database itself having issues as well as other issues related directly to the running instance of the Ironic API. In most cases, simply restarting Ironic has resolved the issue. Regardless, if the backend is unavailable, Ironic serves little utility. Therefore, I recommend changing the livenessProbe to check /v1/nodes/ rather than just /. The same may be true of the Inspector as well, by adding /v1/rules.

Environment:

  • Baremetal Operator version: Observed on v0.3.0, but the check appears in main currently.

/kind bug

@metal3-io-bot metal3-io-bot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Jan 22, 2024
@dankingtech
Copy link
Author

Actually, it looks like /v1/conductors/ would be a better endpoint than /v1/nodes/ as it would usually require less work than the later.

@dtantsur
Copy link
Member

dtantsur commented Feb 7, 2024

To repeat what I mentioned on IRC: any meaningful endpoint will require authentication. So we need to make sure the healtcheck script can use it.

dtantsur added a commit to dtantsur/ironic-image that referenced this issue Apr 4, 2024
This change is the first step towards
metal3-io/baremetal-operator#1528.
Through these scripts, we can decouple the validation logic from
the pod definition and provide more sophisticated tests in the future.

Right now, the same curl command is used (modulo supporting all variants
of deploying Ironic).

Signed-off-by: Dmitry Tantsur <[email protected]>
dtantsur added a commit to dtantsur/ironic-image that referenced this issue Apr 4, 2024
This change is the first step towards
metal3-io/baremetal-operator#1528.
Through these scripts, we can decouple the validation logic from
the pod definition and provide more sophisticated tests in the future.

Right now, the same curl command is used (modulo supporting all variants
of deploying Ironic).

Signed-off-by: Dmitry Tantsur <[email protected]>
dtantsur added a commit to dtantsur/ironic-image that referenced this issue Apr 4, 2024
This change is the first step towards
metal3-io/baremetal-operator#1528.
Through these scripts, we can decouple the validation logic from
the pod definition and provide more sophisticated tests in the future.

Right now, the same curl command is used (modulo supporting all variants
of deploying Ironic).

Signed-off-by: Dmitry Tantsur <[email protected]>
dtantsur added a commit to dtantsur/ironic-image that referenced this issue Apr 5, 2024
This change is the first step towards
metal3-io/baremetal-operator#1528.
Through these scripts, we can decouple the validation logic from
the pod definition and provide more sophisticated tests in the future.

Right now, the same curl command is used (modulo supporting all variants
of deploying Ironic).

Signed-off-by: Dmitry Tantsur <[email protected]>
dtantsur added a commit to dtantsur/ironic-image that referenced this issue Apr 5, 2024
This change is the first step towards
metal3-io/baremetal-operator#1528.
Through these scripts, we can decouple the validation logic from
the pod definition and provide more sophisticated tests in the future.

Right now, the same curl command is used (modulo supporting all variants
of deploying Ironic).

Signed-off-by: Dmitry Tantsur <[email protected]>
dtantsur added a commit to dtantsur/baremetal-operator that referenced this issue Apr 8, 2024
Using them allows us not to care about all possible ways Ironic can be
installed. In the future, we can use the mounted secrets to exercise
less trivial API resources such as conductors or drivers
(see metal3-io#1528).

Signed-off-by: Dmitry Tantsur <[email protected]>
metal3-io-bot pushed a commit to metal3-io-bot/ironic-image that referenced this issue Apr 9, 2024
This change is the first step towards
metal3-io/baremetal-operator#1528.
Through these scripts, we can decouple the validation logic from
the pod definition and provide more sophisticated tests in the future.

Right now, the same curl command is used (modulo supporting all variants
of deploying Ironic).

Signed-off-by: Dmitry Tantsur <[email protected]>
iurygregory pushed a commit to iurygregory/baremetal-operator that referenced this issue Apr 10, 2024
Using them allows us not to care about all possible ways Ironic can be
installed. In the future, we can use the mounted secrets to exercise
less trivial API resources such as conductors or drivers
(see metal3-io#1528).

Signed-off-by: Dmitry Tantsur <[email protected]>
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2024
@dtantsur
Copy link
Member

/remove-lifecycle stale
/triage accepted

With metal3-io-bot/ironic-image@e44c4f7, we now have a path forward. We also need to finish the discussion around #1685 since it affects how we get the credentials.

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels May 15, 2024
@dtantsur
Copy link
Member

/kind feature
/help

@metal3-io-bot
Copy link
Contributor

@dtantsur:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind feature
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added kind/feature Categorizes issue or PR as related to a new feature. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels May 15, 2024
@dtantsur dtantsur removed the kind/bug Categorizes issue or PR as related to a bug. label May 15, 2024
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2024
@tuminoid
Copy link
Member

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
Status: Ironic-image on hold / blocked
Development

No branches or pull requests

4 participants