Add failureThreshold to elastic-agent self-monitoring config #5999

pchila · 2024-11-12T13:47:11Z

What does this PR do?

Use failure_threshold introduced in elastic/beats#41570 in self-monitoring configuration to avoid elastic-agent reporting DEGRADED if it fails to fetch metrics due to a component starting/stopping.
The default value for the failure threshold is set to 2 but it can be configured via config file or fleet policy.

Why is it important?

It is important to avoid a misrepresentation of agent status due to a single metrics fetch erroring out once.
See #5332

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

elasticmachine · 2024-11-12T13:47:14Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine · 2024-11-12T13:47:14Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2024-11-12T13:47:46Z

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-11-12T13:47:46Z

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

blakerouse

Looks good to me.

michalpristas · 2024-11-13T13:09:44Z

internal/pkg/core/monitoring/config/config.go

+	MonitorLogs      bool                  `yaml:"logs" config:"logs"`
+	MonitorMetrics   bool                  `yaml:"metrics" config:"metrics"`
+	MetricsPeriod    string                `yaml:"metrics_period" config:"metrics_period"`
+	FailureThreshold *uint                 `yaml:"failure_threshold" config:"failure_threshold"`


based on this configuration default value for FailureThreshold is nil
but it is not true as we set default value if it's not set.
shouldn't we preset it here in DefaultConfig() to make expectations clear?
i know you took metrics period as an example. not a blocker just thinking out loud

also why *uint seems unconventional and it does not match beat side implementation either.

The reason why the defaults here don't match the beats default is for backward compatibility: until now the first error on a stream in beats would trigger a status change to DEGRADED and beats maintains that behavior by default.

Here on the agent side the reason for *uint comes from:

zero value (0) would disable any status change for the monitoring metricbeat streams

the default is managed in internal/pkg/agent/application/monitoring/v1_monitor.go when we are sure there's been no set of value in config (pointer == nil)

I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).
Another case I am concerned with is unmarshaling from a fleet policy that does not specify any value for the threshold: if we set 2 or 0 by default there we could be overriding some other value...
Overall having a clear "not set" value that matches the go zero value seemed a safer approach

can you elaborate about backward compatibility?
in a matrix of config/agent
old/old and new/new we don't care
old/new - means we have old config and we should aim for safer behavior
new/old - new config with old agent we don't understand newly introduced keywords and ignore them, behavior is not there

I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).

this applies for probably any config we have

can you elaborate about backward compatibility? in a matrix of config/agent old/old and new/new we don't care old/new - means we have old config and we should aim for safer behavior new/old - new config with old agent we don't understand newly introduced keywords and ignore them, behavior is not there

When I mentioned backward compatibility I was referring to metricbeat and the reason why the defaults in agent don't match the one in beats as it was mentioned in an earlier comment (quoted below)

also why *uint seems unconventional and it does not match beat side implementation either.

What I meant with "backward compatibility" was not "backward compatibility between agent and beats version" but rather "metricbeat default value for failureThreshold takes into account the current behaviour when a metricset stream errors out during fetch"

There's no need for a config/agent old/new matrix as we want to change the default behaviour of monitoring config in order to solve #5332.
The possibility to configure the value in the monitoring config has been added as a way to change/disable the DEGRADED status if we need to mitigate an issue or change behavior testing purposes (it's an escape hatch of sorts).

I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).

this applies for probably any config we have

In this case, all values of uint are valid configuration values (there's no not set value).
The fact that we have already other configuration values that cannot represent "not explicitly set" because of default values, should not preclude to define a new config value as *uint so that a nil value to be interpreted as "value not set".
I am not sure that following what has already been done for other configurations would have definite advantages in this case.

elastic-sonarqube · 2024-11-19T14:46:25Z

Quality Gate passed

Issues
2 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
80.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

* Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509)

…6090) * Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509) Co-authored-by: Paolo Chilà <[email protected]>

* Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509)

cmacknz · 2024-11-20T21:37:05Z

We should fix this in 8.15 and 8.16 as well since it is a significant source of test flakiness and a bug fix.

* Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509) # Conflicts: # internal/pkg/agent/application/monitoring/v1_monitor.go

* Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509)

…6105) * Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509) Co-authored-by: Paolo Chilà <[email protected]>

pchila added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team skip-changelog labels Nov 12, 2024

pchila self-assigned this Nov 12, 2024

pchila requested a review from a team as a code owner November 12, 2024 13:47

pchila requested review from michalpristas and michel-laterman November 12, 2024 13:47

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 12, 2024

pierrehilbert requested review from blakerouse and removed request for michel-laterman November 12, 2024 14:39

blakerouse approved these changes Nov 12, 2024

View reviewed changes

michalpristas approved these changes Nov 13, 2024

View reviewed changes

pchila added 4 commits November 19, 2024 15:16

Add failureThreshold to elastic-agent self-monitoring config

7c58844

lint

988d47b

fix unit tests

de9470e

revert streams and beatStreams to []any in injectMetricsInput()

41471b3

pchila force-pushed the add-failure-thresholds-to-monitoring-config branch from b8e4557 to 41471b3 Compare November 19, 2024 14:16

pchila merged commit 2a46509 into elastic:main Nov 20, 2024
14 checks passed

mergify bot pushed a commit that referenced this pull request Nov 20, 2024

Add failureThreshold to elastic-agent self-monitoring config (#5999)

e70deef

* Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509)

mergify bot mentioned this pull request Nov 20, 2024

[8.x](backport #5999) Add failureThreshold to elastic-agent self-monitoring config #6090

Merged

3 tasks

cmacknz added the backport-8.16 Automated backport with mergify label Nov 20, 2024

mergify bot pushed a commit that referenced this pull request Nov 20, 2024

Add failureThreshold to elastic-agent self-monitoring config (#5999)

6f2fa1d

* Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509)

cmacknz added the backport-8.15 Automated backport to the 8.15 branch with mergify label Nov 20, 2024

mergify bot mentioned this pull request Nov 20, 2024

[8.16](backport #5999) Add failureThreshold to elastic-agent self-monitoring config #6105

Merged

3 tasks

mergify bot mentioned this pull request Nov 20, 2024

[8.15](backport #5999) Add failureThreshold to elastic-agent self-monitoring config #6106

Closed

3 tasks

pchila added a commit that referenced this pull request Nov 26, 2024

Add failureThreshold to elastic-agent self-monitoring config (#5999)

a567a90

* Add failureThreshold to elastic-agent self-monitoring config (cherry picked from commit 2a46509)

pchila mentioned this pull request Nov 27, 2024

Increase failure threshold for agent monitoring inputs from 2 to 5 #6160

Merged

1 task

Alphayeeeet mentioned this pull request Feb 11, 2025

Introduced failureThreshold causes agents to become unhealthy elastic/beats#42672

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add failureThreshold to elastic-agent self-monitoring config #5999

Add failureThreshold to elastic-agent self-monitoring config #5999

Uh oh!

pchila commented Nov 12, 2024

Uh oh!

elasticmachine commented Nov 12, 2024

Uh oh!

elasticmachine commented Nov 12, 2024

Uh oh!

mergify bot commented Nov 12, 2024

Uh oh!

mergify bot commented Nov 12, 2024

Uh oh!

blakerouse left a comment

Uh oh!

michalpristas Nov 13, 2024 •

edited

Loading

Uh oh!

michalpristas Nov 13, 2024 •

edited

Loading

Uh oh!

pchila Nov 13, 2024

Uh oh!

michalpristas Nov 14, 2024

Uh oh!

pchila Nov 14, 2024

Uh oh!

elastic-sonarqube bot commented Nov 19, 2024

Uh oh!

Uh oh!

cmacknz commented Nov 20, 2024

Uh oh!

Uh oh!

Add failureThreshold to elastic-agent self-monitoring config #5999

Add failureThreshold to elastic-agent self-monitoring config #5999

Uh oh!

Conversation

pchila commented Nov 12, 2024

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

Uh oh!

elasticmachine commented Nov 12, 2024

Uh oh!

elasticmachine commented Nov 12, 2024

Uh oh!

mergify bot commented Nov 12, 2024

Uh oh!

mergify bot commented Nov 12, 2024

Uh oh!

blakerouse left a comment

Choose a reason for hiding this comment

Uh oh!

michalpristas Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalpristas Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pchila Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

michalpristas Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

pchila Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

elastic-sonarqube bot commented Nov 19, 2024

Quality Gate passed

Uh oh!

Uh oh!

cmacknz commented Nov 20, 2024

Uh oh!

Uh oh!

michalpristas Nov 13, 2024 •

edited

Loading

michalpristas Nov 13, 2024 •

edited

Loading