(fix): cdc watermark updater main #22822

XuPeng-SH · 2025-11-12T01:14:05Z

User description

What type of PR is this?

Which issue(s) this PR fixes:

issue #22718

What this PR does / why we need it:

cdc watermark updater main

PR Type

Bug fix, Enhancement

Description

Add watermark stall detection to CDC table streams with configurable thresholds
Track snapshot progress and emit warnings when watermark fails to advance
Implement retryable error handling when stall threshold exceeded
Add new metrics for monitoring snapshot stalls and table activity
Fix watermark updater to handle missing watermarks gracefully

Diagram Walkthrough

flowchart LR
  A["TableChangeStream"] -->|"detects no progress"| B["handleSnapshotNoProgress"]
  B -->|"increments counter"| C["CdcTableNoProgressCounter"]
  B -->|"sets gauge"| D["CdcTableStuckGauge"]
  B -->|"exceeds threshold"| E["retryable error"]
  F["onWatermarkAdvanced"] -->|"resets state"| G["resetWatermarkStallState"]
  G -->|"clears metrics"| D
  H["WatermarkUpdater"] -->|"handles missing WM"| I["graceful fallback"]

File Walkthrough

Relevant files

Enhancement

table_change_stream.go `Implement watermark stall detection framework` pkg/cdc/table_change_stream.go Add watermark stall detection fields to `TableChangeStream` struct Implement `TableChangeStreamOption` pattern with configurable thresholds for stall detection and warning intervals Add `handleSnapshotNoProgress()` to detect and report snapshot timestamp stalls with throttled warnings Add `resetWatermarkStallState()` to clear stall tracking when progress resumes Add `onWatermarkAdvanced()` callback to update metrics and reset stall state on successful watermark advancement Initialize stall detection metrics in stream constructor Integrate stall detection into `processWithTxn()` workflow	+159/-27
cdc_metrics.go `Add snapshot stall counter metric` pkg/util/metric/v2/cdc_metrics.go Add `CdcTableNoProgressCounter` metric to track snapshot stall occurrences per table Register new counter metric in `initCDCMetrics()`	+10/-0

Tests

table_change_stream_test.go `Add comprehensive stall detection tests` pkg/cdc/table_change_stream_test.go Add test helpers `readGaugeValue()` and `readCounterValue()` for metric assertions Add `TestTableChangeStream_HandleSnapshotNoProgress_WarningAndReset` to verify warning emission and metric reset Add `TestTableChangeStream_HandleSnapshotNoProgress_ThresholdExceeded` to verify error on stall threshold breach Add `TestTableChangeStream_HandleSnapshotNoProgress_WarningThrottle` to verify warning throttling behavior Add `TestTableChangeStream_HandleSnapshotNoProgress_Defaults` to verify default configuration values Update `createTestStream()` helper to accept optional configuration parameters	+153/-1
watermark_updater_test.go `Update watermark updater tests` pkg/cdc/watermark_updater_test.go Update `TestCDCWatermarkUpdater_UpdateWatermarkErrMsg` to expect success instead of error Add `TestCDCWatermarkUpdater_RemoveThenUpdateErrMsg` to verify graceful handling after watermark removal	+24/-2

Bug fix

watermark_updater.go `Fix watermark updater error handling` pkg/cdc/watermark_updater.go Fix `onJobs()` to gracefully handle `ErrNoWatermarkFound` in `JT_CDC_UpdateWMErrMsg` case Initialize missing watermark entries in `readKeysBuffer` instead of failing Add cache population logic in `execReadWM()` to persist successfully read watermarks Import `errors` package for error type checking	+13/-2

Documentation

CDC_USER_GUIDE.md `Document snapshot stall detection feature` pkg/cdc/CDC_USER_GUIDE.md Add "Detect Snapshot Stalls" section documenting stall detection behavior and metrics Document default thresholds (1 minute stall, 10 second warning interval) Provide PromQL query examples for monitoring stalls Add `mo_cdc_table_snapshot_no_progress_total` to metrics reference table Add example queries for stuck tables and activity timestamps Add snapshot stall detection to alerting recommendations	+35/-0

qodo-merge-pro · 2025-11-12T01:14:38Z

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Action logging: New critical behaviors around snapshot stall detection and retryable errors add warnings and metrics but do not clearly log the final error/decision context when the stall threshold is exceeded, which may hinder auditability of who/what triggered retries and why. Referred Code if s.lastNoProgressWarning.IsZero() \|\| now.Sub(s.lastNoProgressWarning) >= s.noProgressWarningInterval { logutil.Warn( "cdc.table_stream.snapshot_not_advanced", zap.String("table", s.tableInfo.String()), zap.String("from-ts", fromTs.ToString()), zap.String("snapshot-ts", snapshotTs.ToString()), zap.Duration("stall-duration", stalledFor), zap.Duration("threshold", s.watermarkStallThreshold), ) s.lastNoProgressWarning = now } if stalledFor >= s.watermarkStallThreshold { s.retryable = true return moerr.NewInternalErrorf( ctx, "CDC tableChangeStream %s snapshot timestamp stuck for %v (threshold %v)", s.tableInfo.String(), stalledFor, s.watermarkStallThreshold, ... (clipped 2 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Error detail exposure: Warning and error messages include table identifiers via s.tableInfo.String(), which could expose internal resource names if surfaced to end users rather than internal logs. Referred Code logutil.Warn( "cdc.table_stream.snapshot_not_advanced", zap.String("table", s.tableInfo.String()), zap.String("from-ts", fromTs.ToString()), zap.String("snapshot-ts", snapshotTs.ToString()), zap.Duration("stall-duration", stalledFor), zap.Duration("threshold", s.watermarkStallThreshold), ) s.lastNoProgressWarning = now Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-merge-pro · 2025-11-12T01:15:33Z

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix incorrect metric update logic Remove the incorrect update to the `mo_cdc_table_last_activity_timestamp` metric within the `handleSnapshotNoProgress` function to ensure it only reflects the time of successful progress. pkg/cdc/table_change_stream.go [794-804] func (s *TableChangeStream) handleSnapshotNoProgress(ctx context.Context, fromTs, snapshotTs types.TS) error { now := time.Now() tableLabel := s.progressTracker.tableKey() s.progressTracker.RecordRetry() v2.CdcTableNoProgressCounter.WithLabelValues(tableLabel).Inc() - v2.CdcTableLastActivityTimestamp.WithLabelValues(tableLabel).Set(float64(now.Unix())) if s.noProgressSince.IsZero() { s.noProgressSince = now } ... `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a logical flaw where a metric for last successful activity is updated upon failure, which would make monitoring and alerting on this metric unreliable.	Medium
Possible issue	Prevent data race on shared map Prevent a potential data race by adding a lock around the modification of `u.readKeysBuffer` when handling a `JT_CDC_UpdateWMErrMsg` job. pkg/cdc/watermark_updater.go [423-442] case JT_CDC_UpdateWMErrMsg: if _, err := u.GetFromCache(context.Background(), job.Key); err != nil { -- job.DoneWithErr(err) -- continue -+ if !errors.Is(err, ErrNoWatermarkFound) { -+ job.DoneWithErr(err) -+ continue -+ } -+ if _, exists := u.readKeysBuffer[job.Key]; !exists { -+ u.readKeysBuffer[job.Key] = WatermarkResult{} -+ } + if !errors.Is(err, ErrNoWatermarkFound) { + job.DoneWithErr(err) + continue + } + u.Lock() + if _, exists := u.readKeysBuffer[job.Key]; !exists { + u.readKeysBuffer[job.Key] = WatermarkResult{} + } + u.Unlock() } u.committingErrMsgBuffer = append(u.committingErrMsgBuffer, job) `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a potential data race on the `readKeysBuffer` map due to concurrent access without a lock, which could lead to unpredictable behavior or crashes.	Medium
More

XuPeng-SH added 3 commits November 12, 2025 09:12

update

48c0a51

update

fa8f27a

update

7dce742

XuPeng-SH requested review from aptend, ck89119, jiangxinmeng1 and zhangxu19830126 as code owners November 12, 2025 01:14

XuPeng-SH temporarily deployed to ci November 12, 2025 01:14 — with GitHub Actions Inactive

matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Nov 12, 2025

qodo-merge-pro bot added the Review effort 3/5 label Nov 12, 2025

mergify bot added the kind/bug Something isn't working label Nov 12, 2025

ck89119 approved these changes Nov 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(fix): cdc watermark updater main #22822

(fix): cdc watermark updater main #22822

Uh oh!

XuPeng-SH commented Nov 12, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

qodo-merge-pro bot commented Nov 12, 2025

Uh oh!

qodo-merge-pro bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

(fix): cdc watermark updater main #22822

Are you sure you want to change the base?

(fix): cdc watermark updater main #22822

Uh oh!

Conversation

XuPeng-SH commented Nov 12, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Nov 12, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-merge-pro bot commented Nov 12, 2025

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XuPeng-SH commented Nov 12, 2025 •

edited by qodo-merge-pro bot

Loading