-
Notifications
You must be signed in to change notification settings - Fork 129
Record ZooKeeperCommandExecutor timings #1191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Motivation: Currently, there are no metrics to identify wihch part of a `Command` execution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric. - Command execution time - Log replay time - Log store time Modifications: - Add `ReplicationTimings` to recode the timings. - Add `ReplicationMetrics` to export the recorded metrics per project through Micrometer. Result: You can now use the following four metrics to measure the write performance of Central Dogma. - `replication.lock.waiting` - `replication.command.execution` - `replication.log.replay` - `replication.log.store`
WalkthroughAdds new internal classes ReplicationMetrics and ReplicationTimings to capture per-project replication timings. Refactors ZooKeeperCommandExecutor to instrument lock acquisition, log replay, command execution, and log store phases, creating and recording timings per command without changing public APIs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Caller
participant E as ZooKeeperCommandExecutor
participant T as ReplicationTimings
participant ZK as ZooKeeper
participant M as Metrics (Timers)
participant LS as LogStore
C->>E: execute(command)
E->>E: newReplicationTimings(command)
E->>T: create with ReplicationMetrics (per project)
activate T
E->>T: startLockAcquisition()
E->>ZK: acquireLock()
alt lock acquired
ZK-->>E: success
E->>T: endLockAcquisition(acquired=true)
else lock failed
ZK-->>E: failure
E->>T: endLockAcquisition(acquired=false)
end
E->>T: startLogReplay()
E->>E: replayLogs()
E->>T: endLogReplay()
E->>T: startCommandExecution()
E->>E: apply/normalize command
E->>T: endCommandExecution()
E->>T: startLogStore()
E->>LS: append ReplicationLog
LS-->>E: stored
E->>T: endLogStore()
E-->>C: future/result
E->>T: record() to M
T-->>M: Timer.record(durations)
E->>E: log timingsString()
deactivate T
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationMetrics.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java(1 hunks)server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
server/src/main/java/com/linecorp/centraldogma/server/command/RepositoryCommand.java (1)
RepositoryCommand(35-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: build-ubuntu-latest-jdk-17-coverage
- GitHub Check: build-ubuntu-latest-jdk-11
- GitHub Check: build-ubuntu-latest-jdk-17-min-java-11
- GitHub Check: build-macos-latest-jdk-21
- GitHub Check: build-windows-latest-jdk-21
- GitHub Check: build-ubuntu-latest-jdk-21-snapshot
- GitHub Check: flaky-tests
- GitHub Check: lint
| void record() { | ||
| if (metrics == null) { | ||
| return; | ||
| } | ||
|
|
||
| if (lockAcquired) { | ||
| metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | ||
| } else { | ||
| metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | ||
| } | ||
| metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); | ||
| metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); | ||
| metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid recording metrics for phases that never started.
If we fail before startCommandExecution()/startLogReplay()/startLogStore() runs (e.g., lock acquisition timeout), these timers still record(0, …), bumping the sample count and hiding that the phase never happened. Guard each record with *_StartNanos != 0 so only completed phases contribute measurements, leaving lock failures to be represented solely by the lock timer.
Apply this diff:
@@
- if (lockAcquired) {
- metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
- } else {
- metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
- }
- metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
- metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
- metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+ if (lockAcquisitionStartNanos != 0) {
+ if (lockAcquired) {
+ metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+ } else {
+ metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS);
+ }
+ }
+ if (commandExecutionStartNanos != 0) {
+ metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS);
+ }
+ if (logReplayStartNanos != 0) {
+ metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS);
+ }
+ if (logStoreStartNanos != 0) {
+ metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void record() { | |
| if (metrics == null) { | |
| return; | |
| } | |
| if (lockAcquired) { | |
| metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } else { | |
| metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); | |
| metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); | |
| metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| void record() { | |
| if (metrics == null) { | |
| return; | |
| } | |
| if (lockAcquisitionStartNanos != 0) { | |
| if (lockAcquired) { | |
| metrics.lockAcquireSuccessTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } else { | |
| metrics.lockAcquireFailureTimer().record(lockAcquisitionDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| } | |
| if (commandExecutionStartNanos != 0) { | |
| metrics.commandExecutionTimer().record(commandExecutionDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| if (logReplayStartNanos != 0) { | |
| metrics.logReplayTimer().record(logReplayDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| if (logStoreStartNanos != 0) { | |
| metrics.logStoreTimer().record(logStoreDurationNanos, TimeUnit.NANOSECONDS); | |
| } | |
| } |
🤖 Prompt for AI Agents
In
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationTimings.java
around lines 77 to 90, the record() method currently records zero-duration
metrics for phases that never started; update the method to only call each phase
timer.record(...) if that phase's start timestamp field (e.g.,
commandExecutionStartNanos, logReplayStartNanos, logStoreStartNanos) is non-zero
so that only completed phases contribute measurements; keep the existing lock
timer behavior (it should still record based on
lockAcquired/lockAcquisitionDurationNanos) and add simple guards like if
(commandExecutionStartNanos != 0) { ... } for each of the three phase timers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍 👍
I'm not an expert of ZooKeeper but how about also collecting the lock release time since we collect all kinds of Replication metrics?
Line 973 in 6a58a9f
| return () -> safeRelease(mtx); |
| } catch (Throwable t) { | ||
| future.completeExceptionally(t); | ||
| } finally { | ||
| timings.record(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood that if failure occurs while replaying, certain metrics (e.g. commandExecutionTimer) will record 0 instead of not recording at all
Motivation:
Currently, there are no metrics to identify wihch part of a
Commandexecution is slow. To improve the write performance of Central Dogma, I propose adding three new metrics in addition to the existing lock acquisition metric.Modifications:
ReplicationTimingsto recode the timings.ReplicationMetricsto export the recorded metrics per project through Micrometer.Result:
You can now use the following four metrics to measure the write performance of Central Dogma.
replication.lock.waitingreplication.command.executionreplication.log.replayreplication.log.storeSummary by CodeRabbit
New Features
Refactor