LOGC-7: Implement log processor #13

dvasilas · 2025-11-10T09:13:59Z

Implement the core log processing logic.

Best reviewed commit by commit.

codecov · 2025-11-10T09:20:44Z

Codecov Report

❌ Patch coverage is 69.30320% with 163 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.78%. Comparing base (47f7d2e) to head (266c848).
⚠️ Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
cmd/log-courier/main.go	0.00%	65 Missing ⚠️
pkg/logcourier/processor.go	82.35%	29 Missing and 13 partials ⚠️
pkg/testutil/s3.go	70.96%	28 Missing and 8 partials ⚠️
pkg/util/logging.go	0.00%	12 Missing ⚠️
pkg/logcourier/config.go	33.33%	2 Missing and 2 partials ⚠️
pkg/logcourier/offset.go	84.61%	2 Missing ⚠️
pkg/testutil/clickhouse.go	97.14%	2 Missing ⚠️

❌ Your project status has failed because the head coverage (77.78%) is below the adjusted base coverage (84.90%). You can increase the head coverage or adjust the Removed Code Behavior.

Additional details and impacted files

Files with missing lines	Coverage Δ
pkg/logcourier/configspec.go	`100.00% <ø> (ø)`
pkg/logcourier/logfetch.go	`89.47% <100.00%> (+0.18%)`	⬆️
pkg/s3/uploader.go	`100.00% <ø> (ø)`
pkg/util/config.go	`76.92% <100.00%> (+1.24%)`	⬆️
pkg/logcourier/offset.go	`87.87% <84.61%> (+0.78%)`	⬆️
pkg/testutil/clickhouse.go	`91.51% <97.14%> (+2.20%)`	⬆️
pkg/logcourier/config.go	`69.23% <33.33%> (-10.77%)`	⬇️
pkg/util/logging.go	`0.00% <0.00%> (ø)`
pkg/testutil/s3.go	`70.96% <70.96%> (ø)`
pkg/logcourier/processor.go	`82.35% <82.35%> (ø)`
... and 1 more

... and 1 file with indirect coverage changes

@@            Coverage Diff             @@
##             main      #13      +/-   ##
==========================================
- Coverage   85.31%   77.78%   -7.53%     
==========================================
  Files          14       17       +3     
  Lines         538     1049     +511     
==========================================
+ Hits          459      816     +357     
- Misses         54      186     +132     
- Partials       25       47      +22

Flag	Coverage Δ
unit	`77.78% <69.30%> (-7.53%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fredmnl

Couple of small comments.

I was thinking about the general "cron"-ness of log-courier. We might benefit from not having it trigger at exact periodic intervals but rather on a jittery interval so that we stop having huge spiked of work periodically (looking at you, backbeat).

Also I'm not totally sure I get how batch.MaxTimestamp gets set.

fredmnl · 2025-11-14T14:33:59Z

pkg/logcourier/processor.go

+}
+
+// runCycle executes a single discovery and processing cycle
+func (p *Processor) runCycle(ctx context.Context) error {


The flag might not even be needed since (in my current partial understanding) there is a single-threaded main loop that calls runCycle synchronously.

If we keep the flag, I'd rather have a mutex (although https://pkg.go.dev/sync#Mutex.TryLock doesn't sound very encouraging of this idea), or at least make the boolean thread-safe by using a mutex around the read/write.

Indeed, I was Node.js thinking mode.
Removed it in 51bcddf.

fredmnl · 2025-11-14T14:46:51Z

pkg/logcourier/processor.go

+
+// uploadLogBatchWithRetry handles fetching, building, and uploading with retries
+// Returns the records and offset info needed for committing
+func (p *Processor) uploadLogBatchWithRetry(ctx context.Context, batch LogBatch) ([]LogRecord, time.Time, uint16, error) {


Perhaps you can create a retryer decorator , I'm not sure if Go generics can help in making a very generic one.

Also, [nit] jitter is a nice touch to add to exponential backoff.

Extracted the retry logic to a decorator (309c974)and added jitter (b2cdee8). Thanks for the suggestions.

leif-scality

It seems that if we have multiple log-couriers they are going to build the same log s3 object mutiple times, s3 is going to reject the upload, but still we are going to waist resources building it for nothing. We should have a task queue to prevent duplicate work

cmd/log-courier/main.go

pkg/logcourier/logfetch.go

pkg/logcourier/processor.go

leif-scality · 2025-11-14T16:23:57Z

pkg/logcourier/processor.go

+					"consecutiveFailures", p.consecutiveFailures,
+					"maxConsecutiveCycleFailures", maxConsecutiveCycleFailures)
+
+				if p.consecutiveFailures >= maxConsecutiveCycleFailures {


If we return here we are going to crash log-courier, do we want this?

Yes, my intention here is not stop log-courier
Here is the scenario I have in mind: Clickhouse storage grows, and the circuit-breaker makes the database read-only.
So log-courier can read logs and upload S3 objects, but cannot commit.
In that case, log-courier will keep uploading the same objects indefinitely.
So the idea here is to exit if we are not able to make any commit after 3 attempts (including retries).
However, what will happen in practice is that a higher-level mechanism (supervisord / ballot) will restart log-courier.

It's maybe better to simplify and continue retrying instead of existing in that case.

pkg/logcourier/processor.go

leif-scality · 2025-11-14T16:26:47Z

pkg/logcourier/processor.go

+	// 3. No successes, any transient errors -> cycle fails
+	//    Indicates system-wide issue (ClickHouse down, S3 throttled, etc.)
+	//    After 3 consecutive cycle failures, processor exits.
+	if successCount == 0 && transientErrorCount > 0 {


Weird condition, we should retry in a loop if we have an error, why do we want to discard the error if one batch is ok. If we have 100 batches with 1 success and 99 errors we don't return an error, do we want this?

If we return an error we will trigger the consecutiveFailures mechanism (and possibly crash log-courier), so we want to return an error only when we are not able to make any progress.

The idea is that if we were able to write logs and commit for even 1 bucket, then we should log the errors, but continue to the next processing cycle.

The other case is that we fail for all buckets, but the failure is "transient", for example the account we use for writing does not exist or we have passed incorrect access/secret key. I don't log-courier should exit for this type of errors.

pkg/logcourier/processor.go

pkg/logcourier/offset.go

jonathan-gramain

This PR is quite big 🤯 🙂

jonathan-gramain · 2025-11-17T23:23:17Z

pkg/testutil/s3.go

+	}
+
+	// Delete objects
+	for _, obj := range listOutput.Contents {


Instead of a loop, what about sending a single DeleteObjects requests for efficiency? I know this is just for tests but it can accelerate their execution slightly.

Good point, done.

jonathan-gramain · 2025-11-17T23:37:20Z

pkg/logcourier/processor.go

+		p.consecutiveFailures++
+		p.logger.Error("cycle failed",
+			"error", err,
+			"consecutiveFailures", p.consecutiveFailures,
+			"maxConsecutiveCycleFailures", maxConsecutiveCycleFailures)
+	} else {
+		p.consecutiveFailures = 0
+	}


What about embedding this logic inside runCycle, so it avoids duplicating it in the Run function? runCycle would then only send back an error when it has exhausted its failure limit.

I opted for removing this logic entirely and instead let runCycle retry indefinitely.

jonathan-gramain · 2025-11-17T23:46:53Z

pkg/logcourier/processor.go

+
+	for attempt := 0; attempt <= p.maxRetries; attempt++ {
+		if attempt > 0 {
+			p.logger.Info("retrying upload after backoff",


I'd put this log line after the select, to log it just before we actually do retry.

I kept this because in the updated implementation it prints the backoff duration, and I added a Debug log just before we retry the operation.

jonathan-gramain · 2025-11-17T23:48:39Z

pkg/logcourier/processor.go

+		lastErr = err
+
+		if IsPermanentError(err) {
+			p.logger.Error("permanent error, not retrying",


Are "permanent" errors considered normal in some cases? I understood this from another comment, in which case I suggest using Info level for logging those benign errors (if we can distinguish them easily) to avoid alarming the admin.

Yes, in the sense that they are "usage" errors (a target bucket does not have the policy granting write access to the log delivery user).
Changed to Info.

pkg/logcourier/processor.go

jonathan-gramain · 2025-11-17T23:56:59Z

pkg/logcourier/processor_test.go

+
+			BeforeEach(func() {
+				// Configure viper for all config keys
+				viper.Reset()


You could do logcourier.ConfigSpec.Reset() instead for better encapsulation.

Maybe you can also use a helper to reset the rest to their default values (although I would think viper does this automatically when Reset is called, buy maybe not 🤷 )

Replaced with logcourier.ConfigSpec.Reset()

jonathan-gramain · 2025-11-17T23:59:44Z

pkg/logcourier/processor_test.go

+					Expect(objects).NotTo(BeEmpty(), "Expected at least one log object in S3")
+
+					// Verify object content
+					if len(objects) > 0 {


It may not be necessary to check if the objects array has elements, thanks to the above Expect check.

Thanks! done.

pkg/logcourier/processor_test.go

dvasilas

Thank you for your patience in reviewing this huge PR !
I should have split it into smaller ones 🙈

dvasilas · 2025-11-17T11:27:09Z

pkg/logcourier/processor.go

+}
+
+// runCycle executes a single discovery and processing cycle
+func (p *Processor) runCycle(ctx context.Context) error {


Indeed, I was Node.js thinking mode.
Removed it in 51bcddf.

pkg/logcourier/processor.go

cmd/log-courier/main.go

dvasilas · 2025-11-18T10:19:10Z

pkg/logcourier/processor.go

+		p.consecutiveFailures++
+		p.logger.Error("cycle failed",
+			"error", err,
+			"consecutiveFailures", p.consecutiveFailures,
+			"maxConsecutiveCycleFailures", maxConsecutiveCycleFailures)
+	} else {
+		p.consecutiveFailures = 0
+	}


I opted for removing this logic entirely and instead let runCycle retry indefinitely.

dvasilas · 2025-11-18T10:25:58Z

pkg/logcourier/processor.go

+		lastErr = err
+
+		if IsPermanentError(err) {
+			p.logger.Error("permanent error, not retrying",


Yes, in the sense that they are "usage" errors (a target bucket does not have the policy granting write access to the log delivery user).
Changed to Info.

dvasilas · 2025-11-18T10:36:33Z

pkg/logcourier/processor_test.go

+
+			BeforeEach(func() {
+				// Configure viper for all config keys
+				viper.Reset()


Replaced with logcourier.ConfigSpec.Reset()

dvasilas · 2025-11-18T10:42:30Z

pkg/logcourier/processor_test.go

+					Expect(objects).NotTo(BeEmpty(), "Expected at least one log object in S3")
+
+					// Verify object content
+					if len(objects) > 0 {


Thanks! done.

dvasilas · 2025-11-18T10:47:37Z

pkg/logcourier/processor.go

+
+	for attempt := 0; attempt <= p.maxRetries; attempt++ {
+		if attempt > 0 {
+			p.logger.Info("retrying upload after backoff",


I kept this because in the updated implementation it prints the backoff duration, and I added a Debug log just before we retry the operation.

jonathan-gramain · 2025-11-20T01:26:35Z

pkg/logcourier/processor.go

+			logFields["attempt"] = attempt
+			logFields["backoffSeconds"] = backoff.Seconds()


As best practice I would avoid mutating the parameter logFields but create a local copy instead (say if it is passed from a global static value one day it would be problematic).

You could also consider making the logFields parameter optional (e.g. support nil) and create an empty map in that case.

Good point, thanks !

jonathan-gramain · 2025-11-20T01:48:06Z

pkg/testutil/s3.go

-	for _, obj := range listOutput.Contents {
-		_, delErr := h.client.DeleteObject(ctx, &awss3.DeleteObjectInput{
+	// Delete objects in batch
+	if len(listOutput.Contents) > 0 {


It should be fine for now, but you could consider doing ListObjectVersions and pass the version ID to delete to be able to support versioned buckets as well (it would still work the same on nonversioned buckets).

I added testing with versioned buckets to https://scality.atlassian.net/browse/LOGC-16 to make sure we don't forget.

When no offset exists, max() returns NULL which ClickHouse driver converts to Unix epoch (1970-01-01). Use maxOrNull() and sql.NullTime to properly detect NULL values. Return Go zero time for missing offsets, enabling reliable IsZero() checks. This makes the case in which a bucket does not have a committed offset yet more explicit.

Add OffsetManagerInterface and UploaderInterface to enable tests to use custom implementations (for injecting errors). Existing implementations already satisfy these interfaces.

- S3TestHelper for bucket/object operations - CountingUploader for tracking upload attempts - InsertTestLogWithTargetBucket for inserting test data - FailingOffsetManager for simulating offset commit failures

- Discovery cycle with configurable interval - Parallel batch processing with worker pool - Retry logic - Error classification (permanent vs transient)

Transform stub main into complete application - Load and validate configuration - Set up logging - Create and initialize processor - Signal handling - Shutdown timeout with cleanup

dvasilas force-pushed the improvement/LOGC-7 branch from 819196a to ca263f9 Compare November 10, 2025 09:15

fredmnl approved these changes Nov 14, 2025

View reviewed changes

leif-scality reviewed Nov 14, 2025

View reviewed changes

jonathan-gramain reviewed Nov 18, 2025

View reviewed changes

dvasilas commented Nov 18, 2025

View reviewed changes

dvasilas requested review from jonathan-gramain and leif-scality November 18, 2025 10:55

leif-scality approved these changes Nov 19, 2025

View reviewed changes

jonathan-gramain approved these changes Nov 20, 2025

View reviewed changes

dvasilas added 6 commits November 20, 2025 11:59

LOGC-7: Add utility for parsing log levels

cffea9f

LOGC-7: Add raftSessionID to log record and schema

9530e6c

LOGC-7: Add interfaces for testing

1c8915e

Add OffsetManagerInterface and UploaderInterface to enable tests to use custom implementations (for injecting errors). Existing implementations already satisfy these interfaces.

LOGC-7: Add S3 and ClickHouse test utilities

b162a72

- S3TestHelper for bucket/object operations - CountingUploader for tracking upload attempts - InsertTestLogWithTargetBucket for inserting test data - FailingOffsetManager for simulating offset commit failures

LOGC-7: Extend configuration for processor

ffd112b

dvasilas force-pushed the improvement/LOGC-7 branch from 06ae75d to e3ab787 Compare November 20, 2025 10:04

dvasilas added 4 commits November 20, 2025 12:23

LOGC-7: Add processor that orchestrates log processing

cbfd44a

- Discovery cycle with configurable interval - Parallel batch processing with worker pool - Retry logic - Error classification (permanent vs transient)

LOGC-7: Wire processor into main

7be7a61

Transform stub main into complete application - Load and validate configuration - Set up logging - Create and initialize processor - Signal handling - Shutdown timeout with cleanup

LOGC-7: Move shutdown timeout config read to initialization

3e5f477

LOGC-7: Rename raftSessionId to raftSessionID

266c848

dvasilas force-pushed the improvement/LOGC-7 branch from e3ab787 to 266c848 Compare November 20, 2025 10:24

dvasilas merged commit 97a76e9 into main Nov 20, 2025
2 of 3 checks passed

dvasilas deleted the improvement/LOGC-7 branch November 20, 2025 10:28

		logFields["attempt"] = attempt
		logFields["backoffSeconds"] = backoff.Seconds()

LOGC-7: Implement log processor #13

LOGC-7: Implement log processor #13

Conversation

dvasilas commented Nov 10, 2025

Uh oh!

codecov bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fredmnl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leif-scality left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonathan-gramain left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dvasilas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

codecov bot commented Nov 10, 2025 •

edited

Loading