-
Notifications
You must be signed in to change notification settings - Fork 189
Add failureThreshold to elastic-agent self-monitoring config #5999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
pchila
merged 4 commits into
elastic:main
from
pchila:add-failure-thresholds-to-monitoring-config
Nov 20, 2024
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on this configuration default value for FailureThreshold is
nil
but it is not true as we set default value if it's not set.
shouldn't we preset it here in
DefaultConfig()
to make expectations clear?i know you took metrics period as an example. not a blocker just thinking out loud
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also why
*uint
seems unconventional and it does not match beat side implementation either.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the defaults here don't match the beats default is for backward compatibility: until now the first error on a stream in beats would trigger a status change to DEGRADED and beats maintains that behavior by default.
Here on the agent side the reason for *uint comes from:
internal/pkg/agent/application/monitoring/v1_monitor.go
when we are sure there's been no set of value in config (pointer == nil)I suppose I could set the default value in func DefaultConfig() *MonitoringConfig but if by any chance the Monitoring config is deserialized without calling DefaultConfig() we would get an erraneous value set in the failureThreshold (0).
Another case I am concerned with is unmarshaling from a fleet policy that does not specify any value for the threshold: if we set
2
or0
by default there we could be overriding some other value...Overall having a clear "not set" value that matches the go zero value seemed a safer approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate about backward compatibility?
in a matrix of config/agent
old/old and new/new we don't care
old/new - means we have old config and we should aim for safer behavior
new/old - new config with old agent we don't understand newly introduced keywords and ignore them, behavior is not there
this applies for probably any config we have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I mentioned backward compatibility I was referring to metricbeat and the reason why the defaults in agent don't match the one in beats as it was mentioned in an earlier comment (quoted below)
What I meant with "backward compatibility" was not "backward compatibility between agent and beats version" but rather "metricbeat default value for failureThreshold takes into account the current behaviour when a metricset stream errors out during fetch"
There's no need for a config/agent old/new matrix as we want to change the default behaviour of monitoring config in order to solve #5332.
The possibility to configure the value in the monitoring config has been added as a way to change/disable the DEGRADED status if we need to mitigate an issue or change behavior testing purposes (it's an escape hatch of sorts).
In this case, all values of
uint
are valid configuration values (there's nonot set
value).The fact that we have already other configuration values that cannot represent "not explicitly set" because of default values, should not preclude to define a new config value as
*uint
so that anil
value to be interpreted as "value not set".I am not sure that following what has already been done for other configurations would have definite advantages in this case.