[Possible Bug] fluent-bit engine shutdown and FB pod stays RUNNING after SIGTERM #859

containerckf · 2024-10-05T17:35:15Z

Describe the question/issue

Experiencing seemingly a fluent-bit related bug (low frequency and sporadic) where in the FB pod is not correctly sending logs from the node. Additionally the node disk space is slowly filled up where flb files are leaked onto the disk. The affected FB pod stays in RUNNING state even after a SIGTERM is received.

The fluent-bit engine shutdown after 5 seconds, however, child processes/tasks such as input:tail:tail.0 kept running and collecting flb files. The container was left running in a non-working state until manual intervention.

Fluent Bit Log Output

[engine] caught signal (SIGTERM)
[ info] [input] pausing tail.0
[ info] [input] pausing tail.1
[ info] [input] pausing tail.2
[ info] [input] pausing systemd.3
[ info] [input] pausing tail.4
[ info] [input] pausing tail.5
[ info] [input] pausing tail.6
[ info] [input] pausing tail.7
[ info] [input] pausing storage_backlog.8
[ warn] [engine] service will shutdown in max 5 seconds
[ info] [task] tail/tail.0 has 128 pending task(s):
...
[ info] [task]   task_id=0 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=1 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=2 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=3 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=4 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=5 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=6 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=7 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=8 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
[ info] [task]   task_id=9 still running on route(s): cloudwatch_logs/cloudwatch_logs.0 
...
[ info] [engine] service has stopped (215 pending tasks)
[output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopping...

Below showing the files leaked to the disk:

root@ip-:/var/fluent-bit/state/flb-storage/tail.0# while true; do echo "number of flb files" $(ls -1 | wc -l); sleep 1; done
number of flb files 5871
number of flb files 5866
number of flb files 5862
number of flb files 5859
number of flb files 5860
number of flb files 5856
number of flb files 5854

Fluent Bit Version Info

aws-for-fluent-bit version 2.31.12.20231011

Pod Configuration:

Name:                 aws-for-fluent-bit-xn9hn
...
Controlled By:  DaemonSet/aws-for-fluent-bit
Containers:
  aws-for-fluent-bit:
    Container ID:   containerd://fe13c77f1c340a68b76a7b749b32d5359aa85905b69f208b9941b8d49eaf6d71
    Image:          public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.12.20231011
    Image ID:       public.ecr.aws/aws-observability/aws-for-fluent-bit@sha256:70d9a689cd23bd1f37ad61e1a31853a1dc32f504926c071ffc60375f68d5ce31
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 13 Sep 2024 10:04:39 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  400Mi
    Requests:
      cpu:     500m
      memory:  100Mi
    Liveness:  http-get http://:2020/api/v1/health delay=30s timeout=10s period=10s #success=1 #failure=2
    Environment:
      AWS_REGION:                   us-east-1
      CLUSTER_NAME:              x
      HTTP_SERVER:                  
      HTTP_PORT:                    2020
      READ_FROM_HEAD:               Off
      READ_FROM_TAIL:               On
      HOST_NAME:                     (v1:spec.nodeName)
      HOSTNAME:                     aws-for-fluent-bit-xn9hn (v1:metadata.name)
      NODE_NAME:                     (v1:spec.nodeName)
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_ROLE_ARN:                 arn:aws:iam::807800687496:role/mosh-prodb-useast1-eks-fluent-bit
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /fluent-bit/etc/ from fluentbit-config (rw)
      /run/log/journal from runlogjournal (ro)
      /var/fluent-bit/state from fluentbitstate (rw)
      /var/log from varlog (ro)
      /var/log/dmesg from dmesg (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4hv8q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  fluentbit-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      aws-for-fluent-bit
    Optional:  false
  varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  runlogjournal:
    Type:          HostPath (bare host directory volume)
    Path:          /run/log/journal
    HostPathType:  
  dmesg:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/dmesg
    HostPathType:  
  fluentbitstate:
    Type:          HostPath (bare host directory volume)
    Path:          /var/fluent-bit/state
    HostPathType:  
  kube-api-access-4hv8q:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

Cluster Details

Version Information
Kubernetes: 1.28
Platform: eks.18

Addon Information:

kube-proxy
Configuration
Version: v1.28.2-eksbuild.2
-----------------------------------
coredns
Configuration
Version: v1.10.1-eksbuild.5
-----------------------------------
vpc-cni
Configuration
Version: v1.16.4-eksbuild.2
-----------------------------------
aws-ebs-csi-driver
Configuration
Version: v1.24.1-eksbuild.1

Application Details

Steps to reproduce issue

Have not been able to reproduce on demand - issue is low frequency

Related Issues

Have combed through a few times and not able to find a similar tracker.

The text was updated successfully, but these errors were encountered:

mw-tlhakhan · 2024-10-07T12:25:37Z

Thanks @containerckf for creating this issue. I can help with further details on this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Possible Bug] fluent-bit engine shutdown and FB pod stays RUNNING after SIGTERM #859

[Possible Bug] fluent-bit engine shutdown and FB pod stays RUNNING after SIGTERM #859

containerckf commented Oct 5, 2024

mw-tlhakhan commented Oct 7, 2024

[Possible Bug] fluent-bit engine shutdown and FB pod stays RUNNING after SIGTERM #859

[Possible Bug] fluent-bit engine shutdown and FB pod stays RUNNING after SIGTERM #859

Comments

containerckf commented Oct 5, 2024

Describe the question/issue

Fluent Bit Log Output

Fluent Bit Version Info

Cluster Details

Application Details

Steps to reproduce issue

Related Issues

mw-tlhakhan commented Oct 7, 2024