Skip to content

Conversation

@slab-msft
Copy link

@slab-msft slab-msft commented Nov 20, 2025

  • pass the input net_setup into network_init to reuse configured options
  • apply flb_input_upstream_set so the upstream inherits the input context

related stale PR (#10487)

Enable support for network setup parameters (like keepalive, timeouts) to be applied to the upstream connection in the Kubernetes events plugin. The network_init function now accepts and copies the input instance's net_setup configuration to the upstream, allowing proper TCP keepalive and timeout configuration.

Fixes Kubernetes events plugin failing to reconnect when an API server control plane node fails. The plugin uses long-lived watch streams to receive events, which can become stale when the underlying control plane node stops responding. The fix propagates network configuration settings (TCP keepalive, connection recycling, timeouts) from the input plugin to the upstream connection, enabling proper detection of dead connections and automatic reconnection to healthy control plane nodes.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change

Example configuration file for the change:

[INPUT]
    Name                kubernetes_events
    Alias               in.k8s_events
    Tag                 k8s_events
    kube_url            https://kubernetes.default.svc.cluster.local:443
    interval_sec        15
    kube_request_limit  150
    DB                  /fluent-bit/db/events.db
    net.connect_timeout         10
    net.keepalive               on
    net.keepalive_idle_timeout  30
    net.tcp_keepalive           on
    net.tcp_keepalive_time      30
    net.tcp_keepalive_interval  30
    net.tcp_keepalive_probes    3

Setup with 3CP apiserver endpoints

IPv4 of CPs: 10.0.0.4, 10.0.0.5, 10.0.0.6

Initially, fluent-bit connects to the 10.0.0.6 (via LB), once the apiserver fails/crashes/shutdowns, the conenction remains established but its stale

FB logs

Fluent Bit v4.1.1
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2025/11/20 16:16:28.375325255] [ info] [fluent bit] version=4.1.1, commit=912b7d783a, pid=1
[2025/11/20 16:16:28.375416549] [ info] [storage] ver=1.5.3, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/11/20 16:16:28.375423634] [ info] [simd    ] SSE2
[2025/11/20 16:16:28.375426985] [ info] [cmetrics] version=1.0.5
[2025/11/20 16:16:28.375429956] [ info] [ctraces ] version=0.6.6
[2025/11/20 16:16:28.375489181] [ info] [input:kubernetes_events:in.k8s_events] initializing
[2025/11/20 16:16:28.375496505] [ info] [input:kubernetes_events:in.k8s_events] storage_strategy='memory' (memory only)
[2025/11/20 16:16:28.413277761] [ info] [input:kubernetes_events:in.k8s_events] thread instance initialized
[2025/11/20 16:16:28.414020513] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2025/11/20 16:16:28.414030781] [ info] [sp] stream processor started
[2025/11/20 16:16:28.414089972] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2025/11/20 16:16:43.56832343] [ info] [input:kubernetes_events:in.k8s_events] Requesting /api/v1/events?watch=1&resourceVersion=2506189
[2025/11/20 16:20:32.773002850] [error] [/build/top/BUILD/fb/src/tls/openssl.c:977 errno=110] Connection timed out
[2025/11/20 16:20:32.773034639] [error] [tls] syscall error: error:00000005:lib(0)::reason(5)
[2025/11/20 16:20:32.773047128] [error] [http_client] broken connection to kubernetes.default.svc.cluster.local:443 ?
[2025/11/20 16:20:32.773052666] [ warn] [input:kubernetes_events:in.k8s_events] kubernetes chunked stream error.
[2025/11/20 16:20:32.773057139] [ info] [input:kubernetes_events:in.k8s_events] kubernetes stream disconnected, ret=-1
[2025/11/20 16:20:35.520307266] [ info] [input:kubernetes_events:in.k8s_events] Requesting /api/v1/events?watch=1&resourceVersion=2506958

Connection trace

root [ / ]# lsof -i -n
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
fluent-bi   1 root   63u  IPv4 5529341      0t0  TCP 10.244.0.23:58860->10.0.0.6:6443 (ESTABLISHED)

root [ / ]# tcpdump -i eth0 port 6443
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
# 3x health probes to 10-0-0-6 unresponsive/unhealthy apiserver
16:19:31.332926 IP fluent-bit.58860 > 10-0-0-6.kube-apiserver.6443: Flags [.], ack 3926392123, win 7823, options [nop,nop,TS val 3952409628 ecr 513079258], length 0
16:20:02.052930 IP fluent-bit.58860 > 10-0-0-6.kube-apiserver.6443: Flags [.], ack 1, win 7823, options [nop,nop,TS val 3952440348 ecr 513079258], length 0
16:20:32.772929 IP fluent-bit.58860 > 10-0-0-6.kube-apiserver.6443: Flags [R.], seq 1, ack 1, win 7823, options [nop,nop,TS val 3952471068 ecr 513079258], length 0
# New connection to healthy apiserver at 10-0-0-4
16:20:35.471104 IP fluent-bit.49078 > 10-0-0-4.kube-apiserver.6443: Flags [S], seq 3540454757, win 64770, options [mss 3810,sackOK,TS val 2693763838 ecr 0,nop,wscale 7], length 0
16:20:35.472730 IP 10-0-0-4.kube-apiserver.6443 > fluent-bit.49078: Flags [S.], seq 3369463790, ack 3540454758, win 65416, options [mss 3810,sackOK,TS val 3613053218 ecr 2693763838,nop,wscale 7], length 0
16:20:35.472756 IP fluent-bit.49078 > 10-0-0-4.kube-apiserver.6443: Flags [.], ack 1, win 507, options [nop,nop,TS val 2693763839 ecr 3613053218], length 0
16:20:35.473155 IP fluent-bit.49078 > 10-0-0-4.kube-apiserver.6443: Flags [P.], seq 1:333, ack 1, win 507, options [nop,nop,TS val 2693763840 ecr 3613053218], length 332
16:20:35.473228 IP 10-0-0-4.kube-apiserver.6443 > fluent-bit.49078: Flags [.], ack 333, win 509, options [nop,nop,TS val 3613053219 ecr 2693763840], length 0

root [ / ]# lsof -i -n
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
fluent-bi   1 root   55u  IPv4 5529629      0t0  TCP *:2020 (LISTEN)
fluent-bi   1 root   62u  IPv4 5540210      0t0  TCP 10.244.0.23:49078->10.0.0.4:6443 (ESTABLISHED)

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Refactor
    • Improved network initialization for the Kubernetes events plugin by adding validation, error handling, and cleanup when upstream setup fails, resulting in more robust behavior and clearer error reporting in failure scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 20, 2025

Walkthrough

Added an upstream-binding validation in the Kubernetes events plugin network init: after creating ctx->upstream the code calls flb_input_upstream_set(ctx->upstream, ctx->ins) and on failure logs an error, destroys and clears the upstream, and returns -1. (49 words)

Changes

Cohort / File(s) Summary
Upstream validation & cleanup
plugins/in_kubernetes_events/kubernetes_events_conf.c
After creating ctx->upstream call flb_input_upstream_set(ctx->upstream, ctx->ins). If it fails, log an error, call flb_upstream_destroy(ctx->upstream), set ctx->upstream = NULL, and return -1. No public API/signature changes.

Sequence Diagram(s)

sequenceDiagram
    participant Input as flb_input
    participant Init as network_init()
    participant Upstr as create_upstream()
    participant Bind as flb_input_upstream_set()

    Input->>Init: call network_init()
    Init->>Upstr: create upstream context
    Upstr-->>Init: upstream created
    Init->>Bind: flb_input_upstream_set(upstream, ins)
    alt bind succeeds
        Bind-->>Init: success
        Init-->>Input: return success
    else bind fails
        Bind-->>Init: error
        Init->>Upstr: flb_upstream_destroy(upstream)
        Init->>Init: ctx->upstream = NULL
        Init-->>Input: return -1 (error)
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Single-file, localized control-flow addition.
  • Review focus: correct flb_input_upstream_set() return handling and ensuring proper cleanup (flb_upstream_destroy and clearing ctx->upstream).

Poem

🐇 I bound the stream with careful paw,

When binding failed, I fixed the flaw,
I tore it down and swept the ground,
Cleared the path and hopped around,
A tiny patch — a joyful bound. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding support for configuring input upstream network setup in the Kubernetes events input plugin.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f6498f and a84296b.

📒 Files selected for processing (1)
  • plugins/in_kubernetes_events/kubernetes_events_conf.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_kubernetes_events/kubernetes_events_conf.c (2)
src/flb_input.c (1)
  • flb_input_upstream_set (2207-2226)
src/flb_upstream.c (1)
  • flb_upstream_destroy (656-698)
🔇 Additional comments (1)
plugins/in_kubernetes_events/kubernetes_events_conf.c (1)

131-136: LGTM! Proper error handling for upstream network setup.

The implementation correctly achieves the PR objective by calling flb_input_upstream_set() to propagate the input instance's network configuration to the upstream. The error handling is complete: it logs a clear message, properly destroys the upstream resource, nullifies the pointer to prevent use-after-free, and signals failure to the caller.

Note: This error check is defensive programming—flb_input_upstream_set() only returns -1 if the upstream is NULL (already verified at lines 126-128). While unlikely to trigger, it's good practice for resilience.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f8c50b and dbe85c1.

📒 Files selected for processing (1)
  • plugins/in_kubernetes_events/kubernetes_events_conf.c (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_kubernetes_events/kubernetes_events_conf.c (2)
src/flb_input.c (1)
  • flb_input_upstream_set (2207-2226)
src/flb_upstream.c (1)
  • flb_upstream_destroy (656-698)
🔇 Additional comments (1)
plugins/in_kubernetes_events/kubernetes_events_conf.c (1)

131-136: LGTM! Proper network setup propagation with correct error handling.

The addition of flb_input_upstream_set correctly propagates the input instance's network configuration (net_setup) to the upstream connection, which aligns with the PR objective. The error handling properly cleans up resources by destroying the upstream and setting it to NULL before returning.

@cosmo0920
Copy link
Contributor

In fluent org repos, we need to add a Signed-off line in each of commits.
So, could you follow the guideline which is described here?
https://github.com/fluent/fluent-bit/pull/11188/checks?check_run_id=55962049660

…ork setup

- pass the input net_setup into network_init to reuse configured options
- apply flb_input_upstream_set so the upstream inherits the input context

Signed-off-by: slabbancz <[email protected]>
@slab-msft slab-msft force-pushed the k8s_events_input_net branch from 6f6498f to a84296b Compare November 21, 2025 08:36
@slab-msft
Copy link
Author

In fluent org repos, we need to add a Signed-off line in each of commits. So, could you follow the guideline which is described here? https://github.com/fluent/fluent-bit/pull/11188/checks?check_run_id=55962049660

@cosmo0920 commit with sig updated. Is there anything else blocking us from running the workflows/merging ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants