Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around high cardinality metrics from otelcol receivers #4769

Merged
merged 27 commits into from
Sep 1, 2023
Merged

Work around high cardinality metrics from otelcol receivers #4769

merged 27 commits into from
Sep 1, 2023

Conversation

glindstedt
Copy link
Contributor

@glindstedt glindstedt commented Aug 9, 2023

PR Description

Heavily inspired by open-telemetry/opentelemetry-collector#7543 which was mentioned in the upstream issue. This PR adds views to the MeterProvider of the receiver component which filters out some high cardinality attributes.

Which issue(s) this PR fixes

Fixes #4764

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated (N/A)

@glindstedt glindstedt marked this pull request as ready for review August 9, 2023 14:30
@glindstedt glindstedt requested a review from a team as a code owner August 9, 2023 14:30
@glindstedt
Copy link
Contributor Author

I suspect the Test failure was a flake? It succeeded on the commit before the changelog update

@mattdurham
Copy link
Collaborator

Likely, the OSX and Win tests are unreliable at times. I restarted the tests.

@ptodev
Copy link
Contributor

ptodev commented Aug 14, 2023

Thank you for your contribution! I agree that we should do something about this problem. I'd have preferred if the Collector code avoids high cardinality metrics by default, but they have to be explicitly disabled via a feature gate. In any case, I believe we should be able to toggle this on and off in the Grafana Agent.

Would you be willing to try to implement a boolean flag called OtelcolDisableHighCardinalityMetrics inside a new metrics block, similar to the logging and tracing blocks which we currently have? If you think this is too much work, I could try to do this myself and we will combine our works later. But I personally consider such a switch to be necessary for merging this change, so that we don't affect users who already depend on this functionality.

By the way, we will also need to document those metrics. I raised a separate issue for this - #4769 - and we could work on this in a separate PR.

@glindstedt
Copy link
Contributor Author

@ptodev Ok! Yeah I think I can take a stab at implementing that 👍

@glindstedt
Copy link
Contributor Author

glindstedt commented Aug 15, 2023

@ptodev I've been working on a PoC but ran into an issue. Now I have a metrics config block that creates a metric.MeterProvider which gets passed down to the otelcol component, the problem is that running the Update on only the metrics component and replacing the MeterProvider with a differently configured one doesn't have any effect on the downstream component that it got passed to originally. Is there a way to declare that a component is dependent on another so that an Update on the "parent" triggers updates on all the dependant components? Or in absence maybe a way to force re-evaluation of all components in case a core config like this is changed?

@ptodev
Copy link
Contributor

ptodev commented Aug 16, 2023

There is no need for the metrics block to own the metric.MeterProvider. It's ok to just have a new member in Options struct in component/registry.go. That new member could be an interface which has a function like MeterOptions() which returns a []metric.Option. Then in component/otelcol/receiver/receiver.go we can call that function and pass the metric options down to metric.NewMeterProvider().

@glindstedt
Copy link
Contributor Author

Even so if we just pass down the options the problem will be the same, the components that use those options when building their MeterProviders will need to be Update-ed for the config change to have any effect.

@ptodev
Copy link
Contributor

ptodev commented Aug 17, 2023

I'm 95% sure that it won't be a problem - flow is designed in such a way that if a config update is required, the Update() function needs to be called. You could test it in this way:

  1. Start the agent with this flag switched to some value.
  2. Hit the /metrics endpoint on the agent to see if you have high cardinality metrics.
  3. Update the config file so the flag has the opposite value.
  4. Hit the reload endpoint. Note that this documentation is for static mode, but I think this endpoint works for flow too.
  5. Hit the /metrics endpoint again.

@glindstedt
Copy link
Contributor Author

glindstedt commented Aug 17, 2023

That's what I've been trying but invariably I have to force a rebuild of the other block by changing some variable. Right now I'm testing with this config:

logging {
	level  = "debug"
	format = "logfmt"
}

metrics {
    otelcol_disable_high_cardinality_metrics = false
}

otelcol.receiver.zipkin "default" {
    parse_string_tags = false
    output {
        traces = []
    }
}

If I change the metrics bool and curl /-/reload then that gets rebuilt, the metrics disappear from /metrics (because I create a new Registry, otherwise there would be clashes because the label set is different), and remain broken (since the MeterProvider the receiver has a reference to was writing to the old registry) until I switch the parse_string_tags on the receiver and curl /-/reload again to force a rebuild on that component. Only then does it start working again with the intended behavior.

I think if I refactored to pass down the option and still have the individual components own their own registry it would still be "working" after the metrics component is rebuilt, however it would keep the old behavior until the receiver component itself is rebuilt which doesn't seem to happen on its own.

@rfratto
Copy link
Member

rfratto commented Aug 22, 2023

@ptodev The suggestion to have a new top-level block with a setting that only impacts a subset of components is weird to me; that's typically not how we've written things for Flow in the past.

What do you think about adding this attribute to a block in the component instead? This would allow per-component tuning of metrics, and would be more Flow native in that things would be logically separated.

@ptodev
Copy link
Contributor

ptodev commented Aug 24, 2023

@rfratto I'm used to thinking of these feature gates as global settings, since in theory any Otel component could use them. However, in this case we are not setting an Otel feature gate in the Agent, so I agree that we could have per-component overrides.

The main issue with per-component overrides is that we are not sure which components exactly will be impacted, and how to document them. What do you think about this course of action:

  • Create a new metrics block which is shared between receiver & exporter otel components. This will be similar to how the output block is shared.
  • Add an option disable_high_cardinality_metrics to the metrics block.
  • Add this new metrics block to the code and docs of all exporter and receiver components.

The downside of the above is that even if a component doesn't end up producing high cardinality metrics which match this filter, it'll still have that setting. But I think that's ok.

@rfratto
Copy link
Member

rfratto commented Aug 24, 2023

I'm fine with that proposal, but I would suggest not calling the block metrics since that might be confused with the signals emitted from the component.

Something like debug or debug_metrics would be slightly more descriptive and less ambiguous here.

@ptodev
Copy link
Contributor

ptodev commented Aug 24, 2023

@rfratto - thank you, debug_metrics sounds good to me!

@glindstedt - would you like to try implementing the proposal from this comment please? Thank you for working with us, and for your patience! This change would be very useful. If you are too busy, don't worry, we could also pick it up at some point.

@glindstedt
Copy link
Contributor Author

Thank you both for sorting out a path forward. @ptodev yes I think I'll have some time to implement this in the coming week, but if other priorities come in between I'll let you know!

@glindstedt
Copy link
Contributor Author

I've pushed a commit that adds the debug_metrics block now, but I have a question. Right now the block is not added to the prometheus and loki exporters/receivers, they don't implement the receiver.Arguments/exporter.Arguments interfaces like the other components do so I'm wondering if they are special cases and should be treated as such, or if I should go ahead and add the block anyway?

Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update!

It'd be nice to add a few tests at least to some components to make sure that the field is set as expected.

We would also need a documentation, similar to how the output block is currently documented:

  • There is a shared md file for it.
  • It is listed in the docs for each component two times:
    • In the table of configuration blocks
    • As a paragraph, which imports the shared md file.

In the shared md, we could also mention that this component may not end up using those views if it doesn't generate such metrics. For example, a shared md could look like this:

---
aliases:
- ../../otelcol/debug-metrics-block/
canonical: https://grafana.com/docs/agent/latest/shared/flow/reference/components/debug-metrics-block/
headless: true
---

The `debug_metrics` block configures the metrics which this component generates to monitor its state.

The following arguments are supported:

Name | Type | Description | Default | Required
---- | ---- | ----------- | ------- | --------
`disable_high_cardinality_metrics` | `bool` | Whether to disable certain high cardinality metrics. | `false` | no

`disable_high_cardinality_metrics` is the Grafana Agent equivalent to the `telemetry.useOtelForInternalMetrics` feature gate in the OpenTelemetry Collector. It removes attributes which could case high cardinality metrics. 
For example, attributes with IP addresses and port numbers in metrics about HTTP and gRPC connections will be removed.

component/otelcol/internal/views/views.go Outdated Show resolved Hide resolved
component/otelcol/exporter/loadbalancing/loadbalancing.go Outdated Show resolved Hide resolved
@ptodev
Copy link
Contributor

ptodev commented Aug 29, 2023

I've pushed a commit that adds the debug_metrics block now, but I have a question. Right now the block is not added to the prometheus and loki exporters/receivers, they don't implement the receiver.Arguments/exporter.Arguments interfaces like the other components do so I'm wondering if they are special cases and should be treated as such, or if I should go ahead and add the block anyway?

This is correct - those are Agent-specific components which are not in the Collector, and they are not in scope for this change.

Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some spelling fixes. Switch which for that.

Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you very much for working on this! I just left one minor comment that needs to be addressed prior to merging.

Also, please feel free to update the Changelog with an entry such as this under the "Enhancements" section:

- Added a `disable_high_cardinality_metrics` configuration flag to `otelcol` 
  exporters and receivers to switch high cardinality debug metrics off.  (@glindstedt)

That way the change will be visible in the release notes for the next version, and we can pay tribute to you as well :)


// DebugMetricsArguments configures internal metrics of the components
type DebugMetricsArguments struct {
DisableHighCardinalityMetrics bool `river:"disable_high_cardinality_metrics,attr,optional"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I'm not sure if we should just call this high_cardinality_metrics. IMO we can merge this as it is for now, and I'll have a discussion with the other people on the team if they'd be in favour of changing it prior to the next release.

glindstedt and others added 13 commits September 1, 2023 15:21
@glindstedt
Copy link
Contributor Author

Thank you for the reviews, suggestions commited and changelog updated 👍

@ptodev ptodev merged commit 18c1cd7 into grafana:main Sep 1, 2023
6 checks passed
@glindstedt glindstedt deleted the otel-metric-cardinality-workaround branch September 3, 2023 11:03
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 22, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High cardinality http_server_* metrics from otelcol.receiver.zipkin
5 participants