Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failureThreshold to elastic-agent self-monitoring config #5999

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pchila
Copy link
Member

@pchila pchila commented Nov 12, 2024

What does this PR do?

Use failure_threshold introduced in elastic/beats#41570 in self-monitoring configuration to avoid elastic-agent reporting DEGRADED if it fails to fetch metrics due to a component starting/stopping.
The default value for the failure threshold is set to 2 but it can be configured via config file or fleet policy.

Why is it important?

It is important to avoid a misrepresentation of agent status due to a single metrics fetch erroring out once.
See #5332

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team skip-changelog labels Nov 12, 2024
@pchila pchila self-assigned this Nov 12, 2024
@pchila pchila requested a review from a team as a code owner November 12, 2024 13:47
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Nov 12, 2024

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 12, 2024

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 12, 2024
@pierrehilbert pierrehilbert requested review from blakerouse and removed request for michel-laterman November 12, 2024 14:39
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

MonitorLogs bool `yaml:"logs" config:"logs"`
MonitorMetrics bool `yaml:"metrics" config:"metrics"`
MetricsPeriod string `yaml:"metrics_period" config:"metrics_period"`
FailureThreshold *uint `yaml:"failure_threshold" config:"failure_threshold"`
Copy link
Contributor

@michalpristas michalpristas Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on this configuration default value for FailureThreshold is nil
but it is not true as we set default value if it's not set.
shouldn't we preset it here in DefaultConfig() to make expectations clear?
i know you took metrics period as an example. not a blocker just thinking out loud

Copy link
Contributor

@michalpristas michalpristas Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also why *uint seems unconventional and it does not match beat side implementation either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the defaults here don't match the beats default is for backward compatibility: until now the first error on a stream in beats would trigger a status change to DEGRADED and beats maintains that behavior by default.

Here on the agent side the reason for *uint comes from:

  • zero value (0) would disable any status change for the monitoring metricbeat streams
  • the default is managed in internal/pkg/agent/application/monitoring/v1_monitor.go when we are sure there's been no set of value in config (pointer == nil)

I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).
Another case I am concerned with is unmarshaling from a fleet policy that does not specify any value for the threshold: if we set 2 or 0 by default there we could be overriding some other value...
Overall having a clear "not set" value that matches the go zero value seemed a safer approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate about backward compatibility?
in a matrix of config/agent
old/old and new/new we don't care
old/new - means we have old config and we should aim for safer behavior
new/old - new config with old agent we don't understand newly introduced keywords and ignore them, behavior is not there

I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).

this applies for probably any config we have

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate about backward compatibility? in a matrix of config/agent old/old and new/new we don't care old/new - means we have old config and we should aim for safer behavior new/old - new config with old agent we don't understand newly introduced keywords and ignore them, behavior is not there

When I mentioned backward compatibility I was referring to metricbeat and the reason why the defaults in agent don't match the one in beats as it was mentioned in an earlier comment (quoted below)

also why *uint seems unconventional and it does not match beat side implementation either.

What I meant with "backward compatibility" was not "backward compatibility between agent and beats version" but rather "metricbeat default value for failureThreshold takes into account the current behaviour when a metricset stream errors out during fetch"

There's no need for a config/agent old/new matrix as we want to change the default behaviour of monitoring config in order to solve #5332.
The possibility to configure the value in the monitoring config has been added as a way to change/disable the DEGRADED status if we need to mitigate an issue or change behavior testing purposes (it's an escape hatch of sorts).

I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).

this applies for probably any config we have

In this case, all values of uint are valid configuration values (there's no not set value).
The fact that we have already other configuration values that cannot represent "not explicitly set" because of default values, should not preclude to define a new config value as *uint so that a nil value to be interpreted as "value not set".
I am not sure that following what has already been done for other configurations would have definite advantages in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify bug Something isn't working skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent gets unhealthy temporarily because Beat monitoring sockets are not available
4 participants