-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Flow worker utilization probe (#16532)
* flow: refactor pipeline refs to keep worker flows separate * health: add worker_utilization probe pipeline is: - RED "completely blocked" when last_5_minutes >= 99.999 - YELLOW "nearly blocked" when last_5_minutes > 95 - and inludes "recovering" info when last_1_minute < 80 - YELLOW "completely blocked" when last_1_minute >= 99.999 - YELLOW "nearly blocked" when last_1_minute > 95 * tests: improve coverage of PipelineIndicator probes * Apply suggestions from code review
- Loading branch information
Showing
7 changed files
with
441 additions
and
26 deletions.
There are no files selected for viewing
44 changes: 44 additions & 0 deletions
44
docs/static/troubleshoot/health-pipeline-flow-worker-utilization.asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
[[health-report-pipeline-flow-worker-utilization]] | ||
=== Health Report Pipeline Flow: Worker Utilization | ||
|
||
The Pipeline indicator has a `flow:worker_utilization` probe that is capable of producing one of several diagnoses about blockages in the pipeline. | ||
|
||
A pipeline is considered "blocked" when its workers are fully-utilized, because if they are consistently spending 100% of their time processing events, they are unable to pick up new events from the queue. | ||
This can cause back-pressure to cascade to upstream services, which can result in data loss or duplicate processing depending on upstream configuration. | ||
|
||
The issue typically stems from one or more causes: | ||
|
||
* a downstream resource being blocked, | ||
* a plugin consuming more resources than expected, and/or | ||
* insufficient resources being allocated to the pipeline. | ||
|
||
To address the issue, observe the <<plugin-flow-rates>> from the <<node-stats-api>>, and identify which plugins have the highest `worker_utilization`. | ||
This will tell you which plugins are spending the most of the pipeline's worker resources. | ||
|
||
* If the offending plugin connects to a downstream service or another pipeline that is exerting back-pressure, the issue needs to be addressed in the downstream service or pipeline. | ||
* If the offending plugin connects to a downstream service with high network latency, throughput for the pipeline may be improved by <<tuning-logstash-settings, allocating more worker resources to the pipeline>>. | ||
* If the offending plugin is a computation-heavy filter such as `grok` or `kv`, its configuration may need to be tuned to eliminate wasted computation. | ||
|
||
[[health-report-pipeline-flow-worker-utilization-diagnosis-blocked-5m]] | ||
==== [[blocked-5m]]Blocked Pipeline (5 minutes) | ||
|
||
A pipeline that has been completely blocked for five minutes or more represents a critical blockage to the flow of events through your pipeline that needs to be addressed immediately to avoid or limit data loss. | ||
See above for troubleshooting steps. | ||
|
||
[[health-report-pipeline-flow-worker-utilization-diagnosis-nearly-blocked-5m]] | ||
==== [[nearly-blocked-5m]]Nearly Blocked Pipeline (5 minutes) | ||
|
||
A pipeline that has been nearly blocked for five minutes or more may be creating intermittent blockage to the flow of events through your pipeline, which can result in the risk of data loss. | ||
See above for troubleshooting steps. | ||
|
||
[[health-report-pipeline-flow-worker-utilization-diagnosis-blocked-1m]] | ||
==== [[blocked-1m]]Blocked Pipeline (1 minute) | ||
|
||
A pipeline that has been completely blocked for one minute or more represents a high-risk or upcoming blockage to the flow of events through your pipeline that likely needs to be addressed soon to avoid or limit data loss. | ||
See above for troubleshooting steps. | ||
|
||
[[health-report-pipeline-flow-worker-utilization-diagnosis-nearly-blocked-1m]] | ||
==== [[nearly-blocked-1m]]Nearly Blocked Pipeline (1 minute) | ||
|
||
A pipeline that has been nearly blocked for one minute or more may be creating intermittent blockage to the flow of events through your pipeline, which can result in the risk of data loss. | ||
See above for troubleshooting steps. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.