Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add request metrics for disaster recovery #13825

Merged
merged 33 commits into from
Sep 6, 2024

Conversation

hamistao
Copy link
Contributor

@hamistao hamistao commented Jul 26, 2024

This introduces API rates metrics for Disaster Recovery, the spec related to this should be published once this is merged.

Here is a raw output example of /1.0/metrics with the new metrics:
https://pastebin.canonical.com/p/wvhcnK7tq6/

@github-actions github-actions bot added Documentation Documentation needs updating API Changes to the REST API labels Jul 26, 2024
Copy link

github-actions bot commented Jul 26, 2024

Heads up @mionaalex - the "Documentation" label was applied to this issue.

@hamistao
Copy link
Contributor Author

hamistao commented Jul 26, 2024

@tomponline Please take a look at this when you can, I would appreciate some comments on the overall structure of the solution so far.

@tomponline
Copy link
Member

This is uncomplete, to be done: 1.Handle metric value updates on asynchronous operations 2.Fine tune entity_type assignment 3.Fix GH tests; 4.Manual tests 5.Docs

You can convert this to a checklist in GH so we can see progress as you tick them off.

lxd/api.go Outdated Show resolved Hide resolved
lxd/metrics/types.go Outdated Show resolved Hide resolved
test/suites/metrics.sh Outdated Show resolved Hide resolved
shared/entity/type.go Outdated Show resolved Hide resolved
@hamistao hamistao force-pushed the dr_metrics branch 4 times, most recently from 9f0c24a to ee0e814 Compare July 30, 2024 11:40
lxd/api.go Outdated Show resolved Hide resolved
@hamistao hamistao force-pushed the dr_metrics branch 2 times, most recently from c0cfe23 to 8a77d1e Compare July 30, 2024 21:02
@hamistao
Copy link
Contributor Author

hamistao commented Jul 30, 2024

@mseralessandri @tomponline This is ready for a full review.
The categorization of the endpoints on entity types looks like this, those marked with * are endpoints I categorized on my own as they were not spefically mentioned during the discussions, if you think I should leave those separated or on other category let me know and I will make the changes:

TypeInstance for /{version}/containers, /{version}/virtual-machines and /{version}/instances endpoints.
TypeNetwork for /{version}/network-zones, /{version}/network-allocations, /{version}/network-acls and /{version}/networks endpoints.
TypeStoragePool for /{version}/storage-pools and /{version}/storage-volumes endpoints.
TypeIdentity for /{version}/auth and /{version}/certificates* endpoints.
TypeImage for /{version}/images endpoints.
TypeNode for /{version}/cluster endpoints.
TypeProject for /{version}/projects endpoints.
TypeProfile for /{version}/profiles endpoints.
TypeWarning for /{version}/warnings endpoints.
TypeOperation for /{version}/operations endpoints.
TypeServer for /{version}/events, /{version}/metrics, /, /1.0/, /1.0/events, /1.0/internal and /{version}/resources endpoints.*

@tomponline There is one caveat that should be mentioned. I am using the 400 status on operations to derive if the request result is a server error. But perhaps the 400 status is too broad and may also include some types of client errors (e.g. trying to add a block device to a container). I am not sure if I should handle that differently, maybe perform a more intricate analysis of the operation instead of just checking the status code.

@hamistao hamistao marked this pull request as ready for review July 30, 2024 21:09
@hamistao hamistao force-pushed the dr_metrics branch 2 times, most recently from f7af662 to 09b329b Compare July 30, 2024 21:31
This is useful to mark the request that spawned that operation as
completed when the operation is done.

Signed-off-by: hamistao <[email protected]>
Uses the callback function when the operation finishes to mark
the request that sapwned the operation as completed.

Signed-off-by: hamistao <[email protected]>
As a consequence of the  introduction of the parameter on Render.
Those fields have become obsolete and should be substituted and
removed.

Signed-off-by: hamistao <[email protected]>
This ensures that:
1. Internal metrics are not cached and are always updated. Before this change,
   the new values were computed but only the older cached values were included in the endpoint output.
2. Internal metrics are included when there are no instances on the default project.
   That happened because if no instances were present, the metric set for the default project
   would not be initialized and thus the internal metrics wouldn't have a set to be included in.

This is included on this PR so that the tests won't fail due to the
metrics' values not being updated quick enough.

Signed-off-by: hamistao <[email protected]>
@hamistao
Copy link
Contributor Author

hamistao commented Sep 6, 2024

@tomponline The requested changes were made.


The API rates metrics include `lxd_api_requests_completed_total` and `lxd_api_requests_ongoing`. These metrics can be consumed by an observability tool deployed externally (for example, the [Canonical Observability Stack](https://charmhub.io/topics/canonical-observability-stack) or another third-party tool) to help identify failures or overload on a LXD server. You can set thresholds on the observability tools for these metrics' values to trigger alarms and take programmatic actions.

These metrics consider all endpoints in the [LXD REST API](../api), with the exception of the `/` endpoint. Requests using an invalid URL are also counted. Requests against the metrics server are also counted. Both introduced metrics include a label `entity_type` based on the main entity type that the endpoint is operating on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests against the metrics server are also counted.

What entity type is used for this?

Copy link
Contributor Author

@hamistao hamistao Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the type is determined by the endpoint and it just handles the /1.0 and the metrics endpoints, it would betypeServer. Same as the regular rest server

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tomponline tomponline removed the request for review from mseralessandri September 6, 2024 07:16
@tomponline tomponline merged commit a5526d9 into canonical:main Sep 6, 2024
29 checks passed
masnax added a commit to canonical/microcluster that referenced this pull request Sep 9, 2024
Starting from canonical/lxd#13825 when using
`Render()` we also have to pass the request besides the response writer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Changes to the REST API Documentation Documentation needs updating
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants