Hello,
The cloudwatch_logs output plugin crashes with SIGSEGV when a PutLogEvents call returns ResourceNotFoundException ("The specified log stream does not exist") for a log stream that the plugin had just confirmed as existing (Log Stream ... already exists) milliseconds earlier.
Instead of recovering from this transient error (recreating the stream and retrying, as retry_limit would suggest), the whole
Fluent Bit process receives a SIGSEGV and dies. Because the FireLens log-router container is essential: false, ECS does not restart it, so the application kept running while all application logs were silently dropped for 8 days before we noticed.
Version
- AWS for Fluent Bit Container Image Version: 3.2.5
- Fluent Bit: v4.2.2, commit ddfef36
- Built as a custom image on top of the above base.
Environment / setup
- Platform: Amazon ECS with FireLens (FireLens type fluentbit, config-file-value: /extra.conf, enable-ecs-log-metadata: false)
- Region: eu-west-1
- Storage: memory (storage_strategy='memory')
- Log-router container essential: false, memoryReservation: 256
- The plugin uses dynamic stream creation via log_stream_template: $stream (≈6 distinct streams per service; the affected one,
uam_rest_hibernate.log, is low-traffic / frequently idle).
Output configuration (from the application container, awsfirelens log driver)
"log-router" = {
essential = false
image = local.fluentbit_ecr_image_url
memory_reservation = 256
cpu = 128
restart_policy = {
enabled = true
restartAttemptPeriod = 60
}
environment = [
{ name = "DEFAULT_STREAM", value = "uam_rest_out.log" }
]
firelens_configuration = {
type = "fluentbit"
options = {
enable-ecs-log-metadata = "false"
config-file-type = "file"
config-file-value = "/extra.conf"
}
}
}
log_configuration = {
logDriver = "awsfirelens"
options = {
Name = "cloudwatch_logs"
region = local.region
log_group_name = local.environments[each.key].service.cdw_group
log_stream_name = "uam_rest_out.log"
log_stream_template = "$stream"
auto_create_group = "true"
log_format = "json"
retry_limit = "3"
}
}
Logs (exact crash sequence)
[2026/06/10 14:06:30.380] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Creating log stream uam_rest_hibernate.log in log
group /aws/logs/<redacted>/uam/rec1
[2026/06/10 14:06:30.386] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Log Stream uam_rest_hibernate.log already exists
[2026/06/10 14:06:30.405] [error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with
error='ResourceNotFoundException', message='The specified log stream does not exist.'
[2026/06/10 14:06:30] [engine] caught signal (SIGSEGV)
[2026/06/10 14:06:30.405] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2026/06/10 14:06:30.405] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
Fluent Bit had started cleanly and run for ~30 hours before this crash. No stack trace/backtrace was emitted by the image — happy to re-run with backtrace enabled if you can point me to the flag.
Additional context (we verified)
- The stream was not deleted: AWS CloudTrail shows no DeleteLogStream / DeleteLogGroup on this log group around the crash (last manual DeleteLogStream in the account was ~7 weeks earlier on an unrelated group). So ResourceNotFoundException here was a transient/inconsistent response on a stream that existed, not a real deletion.
- Container exited with exit code 255.
Expected behavior
When PutLogEvents returns ResourceNotFoundException, the plugin should recreate the missing log stream and retry the batch (within retry_limit), or at worst log the error and continue. It must never SIGSEGV — a single API error response should not crash the whole process and take down log routing for every other stream.
Impact
- Total, silent loss of application logs for the affected service until manual restart.
- Because the log-router is a non-essential sidecar, the crash is invisible without external monitoring.
Questions
1. Is this segfault on ResourceNotFoundException in cloudwatch_logs a known issue, and is it fixed in a release newer than AWS for Fluent Bit 3.2.5 / Fluent Bit v4.2.2? If so, which version should we upgrade to?
2. Any recommended mitigation for transient ResourceNotFoundException with log_stream_template + frequently-idle streams (e.g.,filesystem buffering, config options)?
Hello,
The cloudwatch_logs output plugin crashes with SIGSEGV when a PutLogEvents call returns ResourceNotFoundException ("The specified log stream does not exist") for a log stream that the plugin had just confirmed as existing (Log Stream ... already exists) milliseconds earlier.
Instead of recovering from this transient error (recreating the stream and retrying, as retry_limit would suggest), the whole
Fluent Bit process receives a SIGSEGV and dies. Because the FireLens log-router container is essential: false, ECS does not restart it, so the application kept running while all application logs were silently dropped for 8 days before we noticed.
Version
- AWS for Fluent Bit Container Image Version: 3.2.5
- Fluent Bit: v4.2.2, commit ddfef36
- Built as a custom image on top of the above base.
Environment / setup
uam_rest_hibernate.log, is low-traffic / frequently idle).
Output configuration (from the application container, awsfirelens log driver)
Logs (exact crash sequence)
Fluent Bit had started cleanly and run for ~30 hours before this crash. No stack trace/backtrace was emitted by the image — happy to re-run with backtrace enabled if you can point me to the flag.
Additional context (we verified)
Expected behavior
When PutLogEvents returns ResourceNotFoundException, the plugin should recreate the missing log stream and retry the batch (within retry_limit), or at worst log the error and continue. It must never SIGSEGV — a single API error response should not crash the whole process and take down log routing for every other stream.
Impact
Questions
1. Is this segfault on ResourceNotFoundException in cloudwatch_logs a known issue, and is it fixed in a release newer than AWS for Fluent Bit 3.2.5 / Fluent Bit v4.2.2? If so, which version should we upgrade to?
2. Any recommended mitigation for transient ResourceNotFoundException with log_stream_template + frequently-idle streams (e.g.,filesystem buffering, config options)?