Skip to content

feat: update loki canary to support batched log push #17558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
May 20, 2025

Conversation

aarogoss
Copy link
Contributor

@aarogoss aarogoss commented May 2, 2025

What this PR does / why we need it:
This commit introduces a new EntryWriter implementation in the Loki canary codebase, supporting batched log pushes. A new runtime CLI opt, -logs-batch-size receives an int, specifying the # of log lines to batch before they are sent to Loki.

This opt may be set to '0' or '1' to disable batching entirely, in which case the canary functionality remains as-is. If an int value > 1 is passed to this CLI opt, a new BatchedPush instance is created. This instance collects log lines from its channel and stores them in local array; once the array size is at least the value of -logs-batch-size, the logs are then marshaled and pushed to Loki using the standard push API.

A second CLI opt is added, -logs-batch-size-max. This arg restricts the -logs-batch-size opt passed by the user so the # of batched log lines don't blow up the memory of the loki-canary pod. The default value is 20, ensuring the array stays w/in ~20MBs (1MB log payloads). If 20MB is too restrictive, the -logs-batch-size-max can be used to exceed this limit.

This implementation also maintains a 30s countdown timer; every 30s, the timer triggers and whatever logs exist in the "batch" array are pushed to Loki. This ensures batched logs are pushed to Loki even if the canary fails to receive more logs.

Moreover, if/when the canary org is shutdown (i.e., the termination channel receives a signal to quit), the remaining logs are likewise pushed to Loki and the shutdown process will continue.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

aarogoss added 2 commits May 2, 2025 09:55
This commit introduces a new `EntryWriter` implementation in the Loki
canary codebase, supporting batched log pushes.  A new runtime CLI opt,
`-log-batch-size` receives an int, specifying the # of log lines to
batch before they are sent to Loki.

This opt may be set to '0' or '1' to disable batching entirely, in which
case the canary functionality remains as-is.

If an int value > 1 is passed to this CLI opt, a new `BatchedPush`
instance is created.  This instance collects log lines from its channel
and stores them in local array; once the array size is at least the
value of `-log-batch-size`, the logs are then marshaled and pushed to
Loki using the standard push API.

This implementation also maintains a 30s countdown timer; every 30s, the
timer triggers and whatever logs exist in the "batch" array are pushed
to Loki.  This prevents a slowdown in log ingestion into the canary
instance from preventing logs from backing up for too long.

Moreover, if/when the canary org is shutdown (i.e., the termination
channel receives a signal to quit), the remaining logs are likewise
pushed to Loki and the shutdown process continues.
Rather than have a lockable struct with the log lines array, just
maintain a simple callback mechanism which implements locking.  This
means every time a channel in the `BatchedPush.run` function receives a
response, run a function which acquires the lock, checks the array size
(if applicable), flushes all batched log lines to Loki, and releases the
lock.

Either of these implementations should work, but this is simpler.
@pull-request-size pull-request-size bot added size/XL and removed size/L labels May 8, 2025
aarogoss added 6 commits May 8, 2025 11:41
Updates the tests in `canary/writer/push_test.go` to include batching a
logs and terminating the batched writer.  This tests our ability to stop
a writer, but ensure the batched logs are pushed before the writer is
completely stopped.
@aarogoss aarogoss marked this pull request as ready for review May 19, 2025 15:02
@aarogoss aarogoss requested a review from a team as a code owner May 19, 2025 15:02
aarogoss and others added 9 commits May 19, 2025 09:10
Adds a CLI arg w/a default value to limit the upper bound of the
loki-canary log pusher batch size.  Canary pods don't have a lot of
memory, so this implements a limit of 20 log lines in a given batch,
the goal being to restrict the memory to 20MB of logs in memory at any
given point in time.

This limit is only imposed in the canary's `main.go` module when
initializing the canary.  We don't do any additional checks in the
`batched_push.go` implementation to restrict the batch size to an
upper-bound.

This also reorganizes the push tests a little.
Fixes the existing terminate test so it accurately tests shutting down
the BufferedPush writer.  This also adds a separate test for ensuring
the timeout countdown effectively sends all logs in the batched array.

Updates the `main.go` module which checks the logBatchSize and Max so we
allow the batch size to be equal to the max size.

Also adds a few very minor changes to the `batched_push.go` module: a
few comment changes, moving some things around, removing an unnecessary
log line, and updating the `Stop()` function.
aarogoss and others added 6 commits May 19, 2025 14:14
This was causing a race condition in the test which checks for larger
batching.  The WaitGroup was potentially incremented apart from the
other goroutines, causing a potential inconsistency with how this test
waits for routines to wrap up.
…com:grafana/loki into agoss/enhance-loki-canary-ingestion-process
@aarogoss aarogoss enabled auto-merge (squash) May 20, 2025 19:51
@aarogoss aarogoss merged commit b80f186 into main May 20, 2025
65 checks passed
@aarogoss aarogoss deleted the agoss/enhance-loki-canary-ingestion-process branch May 20, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants