-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failureThreshold to elastic-agent self-monitoring config #5999
base: main
Are you sure you want to change the base?
Add failureThreshold to elastic-agent self-monitoring config #5999
Conversation
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request does not have a backport label. Could you fix it @pchila? 🙏
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Quality Gate passedIssues Measures |
MonitorLogs bool `yaml:"logs" config:"logs"` | ||
MonitorMetrics bool `yaml:"metrics" config:"metrics"` | ||
MetricsPeriod string `yaml:"metrics_period" config:"metrics_period"` | ||
FailureThreshold *uint `yaml:"failure_threshold" config:"failure_threshold"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on this configuration default value for FailureThreshold is nil
but it is not true as we set default value if it's not set.
shouldn't we preset it here in DefaultConfig()
to make expectations clear?
i know you took metrics period as an example. not a blocker just thinking out loud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also why *uint
seems unconventional and it does not match beat side implementation either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the defaults here don't match the beats default is for backward compatibility: until now the first error on a stream in beats would trigger a status change to DEGRADED and beats maintains that behavior by default.
Here on the agent side the reason for *uint comes from:
- zero value (0) would disable any status change for the monitoring metricbeat streams
- the default is managed in
internal/pkg/agent/application/monitoring/v1_monitor.go
when we are sure there's been no set of value in config (pointer == nil)
I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).
Another case I am concerned with is unmarshaling from a fleet policy that does not specify any value for the threshold: if we set 2
or 0
by default there we could be overriding some other value...
Overall having a clear "not set" value that matches the go zero value seemed a safer approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate about backward compatibility?
in a matrix of config/agent
old/old and new/new we don't care
old/new - means we have old config and we should aim for safer behavior
new/old - new config with old agent we don't understand newly introduced keywords and ignore them, behavior is not there
I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).
this applies for probably any config we have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate about backward compatibility? in a matrix of config/agent old/old and new/new we don't care old/new - means we have old config and we should aim for safer behavior new/old - new config with old agent we don't understand newly introduced keywords and ignore them, behavior is not there
When I mentioned backward compatibility I was referring to metricbeat and the reason why the defaults in agent don't match the one in beats as it was mentioned in an earlier comment (quoted below)
also why *uint seems unconventional and it does not match beat side implementation either.
What I meant with "backward compatibility" was not "backward compatibility between agent and beats version" but rather "metricbeat default value for failureThreshold takes into account the current behaviour when a metricset stream errors out during fetch"
There's no need for a config/agent old/new matrix as we want to change the default behaviour of monitoring config in order to solve #5332.
The possibility to configure the value in the monitoring config has been added as a way to change/disable the DEGRADED status if we need to mitigate an issue or change behavior testing purposes (it's an escape hatch of sorts).
I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).
this applies for probably any config we have
In this case, all values of uint
are valid configuration values (there's no not set
value).
The fact that we have already other configuration values that cannot represent "not explicitly set" because of default values, should not preclude to define a new config value as *uint
so that a nil
value to be interpreted as "value not set".
I am not sure that following what has already been done for other configurations would have definite advantages in this case.
What does this PR do?
Use
failure_threshold
introduced in elastic/beats#41570 in self-monitoring configuration to avoid elastic-agent reporting DEGRADED if it fails to fetch metrics due to a component starting/stopping.The default value for the failure threshold is set to 2 but it can be configured via config file or fleet policy.
Why is it important?
It is important to avoid a misrepresentation of agent status due to a single metrics fetch erroring out once.
See #5332
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
How to test this PR locally
Related issues
Questions to ask yourself