Skip to content

Conversation

@ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Oct 1, 2025

Motivation:

Currently, there are no metrics to identify wihch part of a Command execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric.

  • Command execution time
  • Log replay time
  • Log store time

Modifications:

  • Add ReplicationTimings to recode the timings.
  • Add ReplicationMetrics to export the recorded metrics per project through Micrometer.

Result:

You can now use the following four metrics to measure the write performance of Central Dogma.

  • replication.lock.waiting
  • replication.command.execution
  • replication.log.replay
  • replication.log.store

Summary by CodeRabbit

  • New Features

    • Added detailed replication timing metrics, enabling per-project insights into lock acquisition, log replay, command execution, and log storage.
    • Execution logs now include concise timing summaries for each replication phase.
  • Refactor

    • Unified and streamlined timing instrumentation across the replication pipeline for consistent observability and lower overhead.

Motivation:

Currently, there are no metrics to identify wihch part of a `Command`
execution is slow. To improve the write performance of Central Dogma, I
propose adding three new metrics in addition to the existing lock
acquisition metric.

- Command execution time
- Log replay time
- Log store time

Modifications:

- Add `ReplicationTimings` to recode the timings.
- Add `ReplicationMetrics` to export the recorded metrics per project
  through Micrometer.

Result:

You can now use the following four metrics to measure the write
performance of Central Dogma.
- `replication.lock.waiting`
- `replication.command.execution`
- `replication.log.replay`
- `replication.log.store`
@ikhoon ikhoon added this to the 0.78.0 milestone Oct 1, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 1, 2025

Walkthrough

Adds new internal classes ReplicationMetrics and ReplicationTimings to capture per-project replication timings. Refactors ZooKeeperCommandExecutor to instrument lock acquisition, log replay, command execution, and log store phases, creating and recording timings per command without changing public APIs.

Changes

Cohort / File(s) Summary
Replication metrics collection
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java, server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
Introduces package-private metrics holder (Micrometer Timers) and a timings recorder that brackets phases (lock, replay, execute, store), records durations, and formats a timings string.
Executor instrumentation
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java
Replaces prior lock timing map with ReplicationMetrics/ReplicationTimings. Wraps blocking execution path with start/end timing calls, logs timings, and records metrics. Adjusts imports and internal helpers; no public API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant E as ZooKeeperCommandExecutor
  participant T as ReplicationTimings
  participant ZK as ZooKeeper
  participant M as Metrics (Timers)
  participant LS as LogStore

  C->>E: execute(command)
  E->>E: newReplicationTimings(command)
  E->>T: create with ReplicationMetrics (per project)
  activate T

  E->>T: startLockAcquisition()
  E->>ZK: acquireLock()
  alt lock acquired
    ZK-->>E: success
    E->>T: endLockAcquisition(acquired=true)
  else lock failed
    ZK-->>E: failure
    E->>T: endLockAcquisition(acquired=false)
  end

  E->>T: startLogReplay()
  E->>E: replayLogs()
  E->>T: endLogReplay()

  E->>T: startCommandExecution()
  E->>E: apply/normalize command
  E->>T: endCommandExecution()

  E->>T: startLogStore()
  E->>LS: append ReplicationLog
  LS-->>E: stored
  E->>T: endLogStore()

  E-->>C: future/result
  E->>T: record() to M
  T-->>M: Timer.record(durations)
  E->>E: log timingsString()
  deactivate T
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • trustin
  • jrhee17

Poem

I hop through locks with ticking feet,
Replay the logs, a steady beat.
Commands now sprint, then gently slow,
Timers nibble nanosec snow.
Store the tale, record the time—
Replication’s rhythm, rabbit rhyme. 🐇⏱️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.85% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Record ZooKeeperCommandExecutor timings” succinctly conveys the primary change of adding timing metrics to the ZooKeeperCommandExecutor path, focusing on the core functionality without extraneous details. It is specific, clear, and directly related to the main modifications in the PR, enabling reviewers to quickly understand the purpose.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd34da0 and 6a58a9f.

📒 Files selected for processing (3)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
  • server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
  • RepositoryCommand (35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build-ubuntu-latest-jdk-17-coverage
  • GitHub Check: build-ubuntu-latest-jdk-11
  • GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
  • GitHub Check: build-macos-latest-jdk-21
  • GitHub Check: build-windows-latest-jdk-21
  • GitHub Check: build-ubuntu-latest-jdk-21-snapshot
  • GitHub Check: flaky-tests
  • GitHub Check: lint

Comment on lines +77 to +90
void record() {
if (metrics == null) {
return;
}

if (lockAcquired) {
metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
} else {
metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
}
metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid recording metrics for phases that never started.

If we fail before startCommandExecution()/startLogReplay()/startLogStore() runs (e.g., lock acquisition timeout), these timers still record(0, …), bumping the sample count and hiding that the phase never happened. Guard each record with *_StartNanos != 0 so only completed phases contribute measurements, leaving lock failures to be represented solely by the lock timer.

Apply this diff:

@@
-        if (lockAcquired) {
-            metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
-        } else {
-            metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
-        }
-        metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
-        metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
-        metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+        if (lockAcquisitionStartNanos != 0) {
+            if (lockAcquired) {
+                metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+            } else {
+                metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+            }
+        }
+        if (commandExecutionStartNanos != 0) {
+            metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
+        }
+        if (logReplayStartNanos != 0) {
+            metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
+        }
+        if (logStoreStartNanos != 0) {
+            metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void record() {
if (metrics == null) {
return;
}
if (lockAcquired) {
metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
} else {
metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
}
metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
}
void record() {
if (metrics == null) {
return;
}
if (lockAcquisitionStartNanos != 0) {
if (lockAcquired) {
metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
} else {
metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
}
}
if (commandExecutionStartNanos != 0) {
metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
}
if (logReplayStartNanos != 0) {
metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
}
if (logStoreStartNanos != 0) {
metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
}
}
🤖 Prompt for AI Agents
In
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
around lines 77 to 90, the record() method currently records zero-duration
metrics for phases that never started; update the method to only call each phase
timer.record(...) if that phase's start timestamp field (e.g.,
commandExecutionStartNanos, logReplayStartNanos, logStoreStartNanos) is non-zero
so that only completed phases contribute measurements; keep the existing lock
timer behavior (it should still record based on
lockAcquired/lockAcquisitionDurationNanos) and add simple guards like if
(commandExecutionStartNanos != 0) { ... } for each of the three phase timers.

Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍

I'm not an expert of ZooKeeper but how about also collecting the lock release time since we collect all kinds of Replication metrics?

} catch (Throwable t) {
future.completeExceptionally(t);
} finally {
timings.record();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood that if failure occurs while replaying, certain metrics (e.g. commandExecutionTimer) will record 0 instead of not recording at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants