Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metrics for services and checks #519

Open
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Nov 13, 2024

Add simple metrics for services and health checks in OpenTelemetry exposition format.

Metrics include:

  • pebble_service_active{service="foo"}
  • pebble_service_starts_total{service="foo"}
  • pebble_check_up{check="bar"}
  • pebble_perform_check_count{check="bar"}
  • pebble_recover_check_count{check="bar"}

More details:

  • Previously, we thought about a self-implemented metrics library, but after spec review and discussion, it is not easy for memory management (removing metrics when services are removed, for example). So we implemented these metrics on existing structs (serviceData and CheckInfo).
  • To make updating metrics easier, CheckManager.checks is updated from map[string]CheckInfo to pointers map[string]*CheckInfo.
  • An interface is created so that if we need to export the metrics in another format, we can extend it easier.
  • Note that this PR contains a basic type identity from feat: add a new basic type identity #563, which needs to be merged first.

@IronCore864 IronCore864 requested a review from benhoyt November 13, 2024 09:11
@IronCore864 IronCore864 marked this pull request as ready for review November 13, 2024 09:11
@IronCore864
Copy link
Contributor Author

IronCore864 commented Nov 14, 2024

Trying to integrate with Prometheus to make sure they are compatible:

Screenshot-counter Screenshot-gauge

@IronCore864
Copy link
Contributor Author

IronCore864 commented Nov 14, 2024

Some investigation into the default metrics that come with the Prometheus Go client:

1 List of Metrics from Go runtime/metrics

c.a. 80 items.

  • /cgo/go-to-c-calls:calls
  • /cpu/*
  • /gc/cycles/*
  • /gc/gogc:percent
  • /gc/gomemlimit:bytes
  • /gc/heap/*
  • /gc/limiter/last-enabled:gc-cycle
  • /gc/pauses:seconds
  • /gc/scan/*
  • /gc/stack/starting-size:bytes
  • /godebug/*
  • /memory/*
  • /sched/gomaxprocs:threads
  • /sched/goroutines:goroutines
  • /sched/latencies:seconds
  • /sched/pauses/*
  • /sync/mutex/wait/total:seconds

2 List of Default Metrics from prometheus/client_golang

c.a. 40 items.

  • go_gc_duration_seconds{quantile="*"}
  • go_gc_duration_seconds_sum
  • go_gc_duration_seconds_count
  • go_gc_gogc_percent
  • go_gc_gomemlimit_bytes
  • go_goroutines
  • go_info{version="go1.23.1"}
  • go_memstats_*
  • go_sched_gomaxprocs_threads
  • go_threads
  • promhttp_metric_handler_requests_in_flight
  • promhttp_metric_handler_requests_total{code="200"}
  • promhttp_metric_handler_requests_total{code="500"}
  • promhttp_metric_handler_requests_total{code="503"}

3 How Are the Default Metrics from prometheus/client_golang Fetched/Calculated

3.1 From runtime and runtime/debug

  • go_goroutines
  • go_threads
  • go_gc_duration_seconds
  • go_memstats_last_gc_time_seconds
  • go_info

3.2 From runtime.MemStats with Some Calculations

  • go_memstats_alloc_bytes
  • go_memstats_alloc_bytes_total
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_frees_total
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_objects
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_last_gc_time_seconds
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_other_sys_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes

3.3 Directly from runtime/metrics but Go-Version-Dependant

  • go_gc_gogc_percent
  • go_gc_gomemlimit_bytes
  • go_sched_gomaxprocs_threads

3.4 Prometheus-Related

  • promhttp_metric_handler_requests_in_flight
  • promhttp_metric_handler_requests_total{code="200"}
  • promhttp_metric_handler_requests_total{code="500"}
  • promhttp_metric_handler_requests_total{code="503"}

Self-processed.

4 Summary

If we want to include these metrics in our self-implemented metrics module, a lot of duplicated work (duplicated with Prometheus) needs to be done, some of which seem to be Go-version-dependent, which means extra work and tests. This creates a bunch of operational overhead.

If these metrics are not necessary, we probably should go with a self-implemented module but if they are nice to have, a 2.2MB size increase of the Pebble binary is worth the price in the long run.

@IronCore864
Copy link
Contributor Author

IronCore864 commented Nov 26, 2024

A PoC to add a new type of identity (basicauth) for the metrics endpoint.

1 Manually Add an Identity from a YAML File

$ cat identity.yaml
identities:
    bob:
        access: read
        basicauth:
            username: foo
            password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.

2 Start Pebble

$ ./pebble run --http=:4000
2024-11-26T14:27:53.682Z [pebble] HTTP API server listening on ":4000".
2024-11-26T14:27:53.682Z [pebble] Started daemon.
2024-11-26T14:27:53.686Z [pebble] POST /v1/services 63.667µs 400
2024-11-26T14:27:53.686Z [pebble] Cannot start default services: no default services

3 Access the Metrics Endpoint with the Newly Created Identity

$ curl -u foo:bar localhost:4000/metrics
# HELP my_counter A simple counter
# TYPE my_counter counter
my_counter 4

4 Access without Identity or with an Invalid Username/Password

$ curl localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ curl -u invalid:invalid localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}

@IronCore864
Copy link
Contributor Author

According to the last spec review, the following changes have been made:

  1. The identity type is renamed from "basicauth" to just "basic".
  2. The username field in the basic identity is removed -- we can just use the name of the identity instead.
  3. Adding a new access type "metrics".
  4. Passwords are hashed using OpenSSL (TODO).

After the first round of refactoring, here are some results:

1 Baisc Identity Name with Special Characters

$ cat identity.yaml
identities:
    "bob:asdf":
        access: read
        basic:
            password: bar
$ ./pebble add-identities --from ./identity.yaml
error: identity "bob:asdf" invalid: identity name "bob:asdf" contains invalid characters (only
       alphanumeric, underscore, and hyphen allowed)

2 Baisc Identity without Username

$ cat identity.yaml
identities:
    bob:
        access: read
        basic:
            password: bar
ubuntu@primary:~/work/pebble2$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.

3 Basic Identity Type "metrics"

$ # access type: read
$ cat identity.yaml
identities:
    bob:
        access: read
        basic:
            password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # no access on the metrics endpoint
$ curl -u bob:bar localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ # access type: metrics
$ cat identity.yaml
identities:
    bob:
        access: metrics
        basic:
            password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # accessing metrics
$ curl -u bob:bar localhost:4000/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 11
my_counter{operation=write,status=success} 22
my_counter{operation=read,status=failed} 11
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 28.12
$ # no access on other endpoints
$ curl -u bob:bar localhost:4000/v1/changes
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ # access type: admin
$ cat identity.yaml
identities:
    bob:
        access: admin
        basic:
            password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # admin can read metrics
$ curl -u bob:bar localhost:4000/v1/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 176
my_counter{operation=write,status=success} 352
my_counter{operation=read,status=failed} 176
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 24.48

TODO: hashing password.

@IronCore864
Copy link
Contributor Author

Notes:

We need to handle the memory usage issue in the future, to make this easier, after discussion, we decided not to use a self-implemented Prometheus-like module to store the metrics centrally, but rather, store the metrics on existing structs like serviceData and CheckInfo. In the last commit, the service-related metrics are implemented on serviceData. Check-related metrics are to be implemented.

@IronCore864
Copy link
Contributor Author

IronCore864 commented Jan 24, 2025

Take some notes before the holiday season:

Currently, the check metrics only work when the check is successful. When it fails, both counters are reset to 0 and never increase. We need more debugging. Maybe it has something to do with updateCheckInfo.

@IronCore864
Copy link
Contributor Author

In the latest commits, m.checks now store pointers so that it's easier to update the metrics. Plus, tests are added:

  • OpenTelemetryWriter: some basic tests are added.
    Check metrics: Start the checks, assert that the counters are correct, and then stop the checks. This calls updateCheckInfo, which should not reset metrics.
    Service metrics: Since the metrics are stored in serviceData, which isn't exported, testing them is difficult. As a workaround, I created a writer that writes to a buffer and then directly checks the open telemetry format result.

@IronCore864 IronCore864 requested a review from benhoyt February 12, 2025 12:27
@IronCore864 IronCore864 marked this pull request as ready for review February 12, 2025 12:27
@IronCore864 IronCore864 changed the title poc: a metrics module for pebble feat: metrics for services and checks Feb 12, 2025
Copy link

@gruyaume gruyaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about this change listening to the mid-cycle roadmap presentation. Nobody asked for this review, please feel free to ignore it.

go.mod Show resolved Hide resolved
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments, though a few structural things. Main thing that needs discussion is whether the perform_check and recover_check counts are actually going to provide the monitoring we want -- let's discuss.

internals/daemon/api_metrics.go Outdated Show resolved Hide resolved
internals/daemon/api_metrics.go Outdated Show resolved Hide resolved
internals/metrics/metrics.go Outdated Show resolved Hide resolved
internals/metrics/metrics.go Outdated Show resolved Hide resolved
internals/metrics/metrics.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved
@benhoyt
Copy link
Contributor

benhoyt commented Feb 13, 2025

@IronCore864 Can you please update the PR description to match our new approach?

Also, it probably goes without saying, but let's be sure not to merge this before the underlying identities PR that this builds on (#563) is reviewed for security and merged.

internals/metrics/metrics.go Outdated Show resolved Hide resolved
internals/metrics/metrics.go Outdated Show resolved Hide resolved
internals/metrics/metrics.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved
@IronCore864
Copy link
Contributor Author

Check metrics updated, tests also updated.

@IronCore864 IronCore864 requested a review from benhoyt February 18, 2025 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants