Skip to content

[cloudwatch_logs] SIGSEGV when PutLogEvents returns ResourceNotFoundException on an existing stream (plugin crashes instead of recreating/retrying) #11959

Description

@BassemKadri

Hello,

The cloudwatch_logs output plugin crashes with SIGSEGV when a PutLogEvents call returns ResourceNotFoundException ("The specified log stream does not exist") for a log stream that the plugin had just confirmed as existing (Log Stream ... already exists) milliseconds earlier.

Instead of recovering from this transient error (recreating the stream and retrying, as retry_limit would suggest), the whole
Fluent Bit process receives a SIGSEGV and dies. Because the FireLens log-router container is essential: false, ECS does not restart it, so the application kept running while all application logs were silently dropped for 8 days before we noticed.

Version
- AWS for Fluent Bit Container Image Version: 3.2.5
- Fluent Bit: v4.2.2, commit ddfef36
- Built as a custom image on top of the above base.

Environment / setup

  • Platform: Amazon ECS with FireLens (FireLens type fluentbit, config-file-value: /extra.conf, enable-ecs-log-metadata: false)
  • Region: eu-west-1
  • Storage: memory (storage_strategy='memory')
  • Log-router container essential: false, memoryReservation: 256
  • The plugin uses dynamic stream creation via log_stream_template: $stream (≈6 distinct streams per service; the affected one,
    uam_rest_hibernate.log, is low-traffic / frequently idle).

Output configuration (from the application container, awsfirelens log driver)

"log-router" = {
          essential          = false
          image              = local.fluentbit_ecr_image_url
          memory_reservation = 256
          cpu                = 128
          restart_policy = {
            enabled              = true
            restartAttemptPeriod = 60
          }
          environment = [
            { name = "DEFAULT_STREAM", value = "uam_rest_out.log" }
          ]
          firelens_configuration = {
            type = "fluentbit"
            options = {
              enable-ecs-log-metadata = "false"
              config-file-type        = "file"
              config-file-value       = "/extra.conf"
            }
          }
        }
log_configuration = {
            logDriver = "awsfirelens"
            options = {
              Name                = "cloudwatch_logs"
              region              = local.region
              log_group_name      = local.environments[each.key].service.cdw_group
              log_stream_name     = "uam_rest_out.log"
              log_stream_template = "$stream"
              auto_create_group   = "true"
              log_format          = "json"
              retry_limit         = "3"
            }
          }

Logs (exact crash sequence)

  [2026/06/10 14:06:30.380] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Creating log stream uam_rest_hibernate.log in log
  group /aws/logs/<redacted>/uam/rec1
  [2026/06/10 14:06:30.386] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Log Stream uam_rest_hibernate.log already exists
  [2026/06/10 14:06:30.405] [error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with
  error='ResourceNotFoundException', message='The specified log stream does not exist.'
  [2026/06/10 14:06:30]      [engine] caught signal (SIGSEGV)
  [2026/06/10 14:06:30.405] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
  [2026/06/10 14:06:30.405] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events

Fluent Bit had started cleanly and run for ~30 hours before this crash. No stack trace/backtrace was emitted by the image — happy to re-run with backtrace enabled if you can point me to the flag.

Additional context (we verified)

  • The stream was not deleted: AWS CloudTrail shows no DeleteLogStream / DeleteLogGroup on this log group around the crash (last manual DeleteLogStream in the account was ~7 weeks earlier on an unrelated group). So ResourceNotFoundException here was a transient/inconsistent response on a stream that existed, not a real deletion.
  • Container exited with exit code 255.

Expected behavior
When PutLogEvents returns ResourceNotFoundException, the plugin should recreate the missing log stream and retry the batch (within retry_limit), or at worst log the error and continue. It must never SIGSEGV — a single API error response should not crash the whole process and take down log routing for every other stream.

Impact

  • Total, silent loss of application logs for the affected service until manual restart.
  • Because the log-router is a non-essential sidecar, the crash is invisible without external monitoring.

Questions

1. Is this segfault on ResourceNotFoundException in cloudwatch_logs a known issue, and is it fixed in a release newer than AWS for Fluent Bit 3.2.5 / Fluent Bit v4.2.2? If so, which version should we upgrade to?
2. Any recommended mitigation for transient ResourceNotFoundException with log_stream_template + frequently-idle streams (e.g.,filesystem buffering, config options)?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions