Record ZooKeeperCommandExecutor timings #1191

ikhoon · 2025-10-01T03:06:15Z

Motivation:

Currently, there are no metrics to identify wihch part of a Command execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric.

Command execution time
Log replay time
Log store time

Modifications:

Add ReplicationTimings to recode the timings.
Add ReplicationMetrics to export the recorded metrics per project through Micrometer.

Result:

You can now use the following four metrics to measure the write performance of Central Dogma.

replication.lock.waiting
replication.command.execution
replication.log.replay
replication.log.store

Summary by CodeRabbit

New Features
- Added detailed replication timing metrics, enabling per-project insights into lock acquisition, log replay, command execution, and log storage.
- Execution logs now include concise timing summaries for each replication phase.
Refactor
- Unified and streamlined timing instrumentation across the replication pipeline for consistent observability and lower overhead.

Motivation: Currently, there are no metrics to identify wihch part of a `Command` execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric. - Command execution time - Log replay time - Log store time Modifications: - Add `ReplicationTimings` to recode the timings. - Add `ReplicationMetrics` to export the recorded metrics per project through Micrometer. Result: You can now use the following four metrics to measure the write performance of Central Dogma. - `replication.lock.waiting` - `replication.command.execution` - `replication.log.replay` - `replication.log.store`

coderabbitai · 2025-10-01T03:06:25Z

Walkthrough

Adds new internal classes ReplicationMetrics and ReplicationTimings to capture per-project replication timings. Refactors ZooKeeperCommandExecutor to instrument lock acquisition, log replay, command execution, and log store phases, creating and recording timings per command without changing public APIs.

Changes

Cohort / File(s)	Summary
Replication metrics collection `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java`, `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java`	Introduces package-private metrics holder (Micrometer Timers) and a timings recorder that brackets phases (lock, replay, execute, store), records durations, and formats a timings string.
Executor instrumentation `server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java`	Replaces prior lock timing map with ReplicationMetrics/ReplicationTimings. Wraps blocking execution path with start/end timing calls, logs timings, and records metrics. Adjusts imports and internal helpers; no public API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant E as ZooKeeperCommandExecutor
  participant T as ReplicationTimings
  participant ZK as ZooKeeper
  participant M as Metrics (Timers)
  participant LS as LogStore

  C->>E: execute(command)
  E->>E: newReplicationTimings(command)
  E->>T: create with ReplicationMetrics (per project)
  activate T

  E->>T: startLockAcquisition()
  E->>ZK: acquireLock()
  alt lock acquired
    ZK-->>E: success
    E->>T: endLockAcquisition(acquired=true)
  else lock failed
    ZK-->>E: failure
    E->>T: endLockAcquisition(acquired=false)
  end

  E->>T: startLogReplay()
  E->>E: replayLogs()
  E->>T: endLogReplay()

  E->>T: startCommandExecution()
  E->>E: apply/normalize command
  E->>T: endCommandExecution()

  E->>T: startLogStore()
  E->>LS: append ReplicationLog
  LS-->>E: stored
  E->>T: endLogStore()

  E-->>C: future/result
  E->>T: record() to M
  T-->>M: Timer.record(durations)
  E->>E: log timingsString()
  deactivate T

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Log ZooKeeper lock acquisition duration for performance monitoring #1182: Earlier addition of per-project ZooKeeper lock-acquisition timing; this PR replaces that approach with ReplicationMetrics/ReplicationTimings and extends timing coverage across replication phases.

Suggested reviewers

trustin
jrhee17

Poem

I hop through locks with ticking feet,
Replay the logs, a steady beat.
Commands now sprint, then gently slow,
Timers nibble nanosec snow.
Store the tale, record the time—
Replication’s rhythm, rabbit rhyme. 🐇⏱️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.85% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “Record ZooKeeperCommandExecutor timings” succinctly conveys the primary change of adding timing metrics to the ZooKeeperCommandExecutor path, focusing on the core functionality without extraneous details. It is specific, clear, and directly related to the main modifications in the PR, enabling reviewers to quickly understand the purpose.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd34da0 and 6a58a9f.

📒 Files selected for processing (3)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java (1 hunks)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)

server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)

RepositoryCommand (35-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: build-ubuntu-latest-jdk-17-coverage
GitHub Check: build-ubuntu-latest-jdk-11
GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
GitHub Check: build-macos-latest-jdk-21
GitHub Check: build-windows-latest-jdk-21
GitHub Check: build-ubuntu-latest-jdk-21-snapshot
GitHub Check: flaky-tests
GitHub Check: lint

coderabbitai · 2025-10-01T03:10:11Z

.../src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java

+    void record() {
+        if (metrics == null) {
+            return;
+        }
+
+        if (lockAcquired) {
+            metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+        } else {
+            metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+        }
+        metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
+        metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
+        metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+    }


⚠️ Potential issue | 🟠 Major

Avoid recording metrics for phases that never started.

If we fail before startCommandExecution()/startLogReplay()/startLogStore() runs (e.g., lock acquisition timeout), these timers still record(0, …), bumping the sample count and hiding that the phase never happened. Guard each record with *_StartNanos != 0 so only completed phases contribute measurements, leaving lock failures to be represented solely by the lock timer.

Apply this diff:

@@ - if (lockAcquired) { - metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); - } else { - metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); - } - metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); - metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); - metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); + if (lockAcquisitionStartNanos != 0) { + if (lockAcquired) { + metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); + } else { + metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); + } + } + if (commandExecutionStartNanos != 0) { + metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); + } + if (logReplayStartNanos != 0) { + metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); + } + if (logStoreStartNanos != 0) { + metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

void record() {

if (metrics == null) {

return;

}

if (lockAcquired) {

metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

} else {

metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

}

metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);

metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);

metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);

}

void record() {

if (metrics == null) {

return;

}

if (lockAcquisitionStartNanos != 0) {

if (lockAcquired) {

metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

} else {

metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);

}

}

if (commandExecutionStartNanos != 0) {

metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);

}

if (logReplayStartNanos != 0) {

metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);

}

if (logStoreStartNanos != 0) {

metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);

}

}

🤖 Prompt for AI Agents

In server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java around lines 77 to 90, the record() method currently records zero-duration metrics for phases that never started; update the method to only call each phase timer.record(...) if that phase's start timestamp field (e.g., commandExecutionStartNanos, logReplayStartNanos, logStoreStartNanos) is non-zero so that only completed phases contribute measurements; keep the existing lock timer behavior (it should still record based on lockAcquired/lockAcquisitionDurationNanos) and add simple guards like if (commandExecutionStartNanos != 0) { ... } for each of the three phase timers.

minwoox

👍 👍 👍

I'm not an expert of ZooKeeper but how about also collecting the lock release time since we collect all kinds of Replication metrics?

centraldogma/server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java

Line 973 in 6a58a9f

return () -> safeRelease(mtx);

jrhee17 · 2025-10-17T00:59:33Z

...ain/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java

            } catch (Throwable t) {
                future.completeExceptionally(t);
+            } finally {
+                timings.record();


Understood that if failure occurs while replaying, certain metrics (e.g. commandExecutionTimer) will record 0 instead of not recording at all

ikhoon added this to the 0.78.0 milestone Oct 1, 2025

ikhoon requested review from jrhee17, minwoox and trustin as code owners October 1, 2025 03:06

ikhoon added the improvement label Oct 1, 2025

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

minwoox approved these changes Oct 1, 2025

View reviewed changes

jrhee17 approved these changes Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Record ZooKeeperCommandExecutor timings #1191

Record ZooKeeperCommandExecutor timings #1191

Uh oh!

ikhoon commented Oct 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

minwoox left a comment •

edited

Loading

Uh oh!

jrhee17 Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Record ZooKeeperCommandExecutor timings #1191

Are you sure you want to change the base?

Record ZooKeeperCommandExecutor timings #1191

Uh oh!

Conversation

ikhoon commented Oct 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

minwoox left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrhee17 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikhoon commented Oct 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 1, 2025 •

edited

Loading

minwoox left a comment •

edited

Loading