Skip to content

TiFlash: add missing TiFlash Grafana metrics#21427

Open
hfxsd wants to merge 1 commit intopingcap:masterfrom
hfxsd:add-missing-tiflash-metrics
Open

TiFlash: add missing TiFlash Grafana metrics#21427
hfxsd wants to merge 1 commit intopingcap:masterfrom
hfxsd:add-missing-tiflash-metrics

Conversation

@hfxsd
Copy link
Collaborator

@hfxsd hfxsd commented Mar 11, 2026

Expand TiFlash monitoring doc by adding many new metrics and sections across the Grafana dashboards. Clarifies that TiFlash proxy/raft metrics overlap heavily with TiKV. Added/renamed entries (e.g. Read Index OPS -> Raft Read Index OPS, Wait Index Duration -> Raft Wait Index Duration) and introduced Write & Delta Management Total. New sections include Imbalance read/write, Memory trace, Storage Read Pool & Data Sharing, PageStorage, Rate Limiter, Raft Snapshot / IngestSST, Disaggregated-Write/Compute, S3, Pipeline Model, TiFlash Resource Control, Status Server, Vector Search, and extensive expansions to TiFlash-Proxy-Summary and TiFlash-Proxy-Details (cluster, errors, server, thread CPU, PD, raft IO/process/message/propose/admin, unified read pool, storage, scheduler, snapshot, task, threads, RocksDB, encryption, etc.). These additions improve coverage and clarity for TiFlash cluster monitoring.

First-time contributors' checklist

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions (in Chinese).

  • master (the latest development version)
  • v9.0 (TiDB 9.0 versions)
  • v8.5 (TiDB 8.5 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)

What is the related PR or file link(s)?

  • This PR is translated from:
  • Other reference link(s):

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

Expand TiFlash monitoring doc by adding many new metrics and sections across the Grafana dashboards. Clarifies that TiFlash proxy/raft metrics overlap heavily with TiKV. Added/renamed entries (e.g. Read Index OPS -> Raft Read Index OPS, Wait Index Duration -> Raft Wait Index Duration) and introduced Write & Delta Management Total. New sections include Imbalance read/write, Memory trace, Storage Read Pool & Data Sharing, PageStorage, Rate Limiter, Raft Snapshot / IngestSST, Disaggregated-Write/Compute, S3, Pipeline Model, TiFlash Resource Control, Status Server, Vector Search, and extensive expansions to TiFlash-Proxy-Summary and TiFlash-Proxy-Details (cluster, errors, server, thread CPU, PD, raft IO/process/message/propose/admin, unified read pool, storage, scheduler, snapshot, task, threads, RocksDB, encryption, etc.). These additions improve coverage and clarity for TiFlash cluster monitoring.
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign qiancai for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hfxsd hfxsd self-assigned this Mar 11, 2026
@ti-chi-bot ti-chi-bot bot added contribution This PR is from a community contributor. missing-translation-status This PR does not have translation status info. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 11, 2026
@hfxsd hfxsd requested review from 3pointer and niubell March 11, 2026 03:30
@hfxsd hfxsd added translation/doing This PR’s assignee is translating this PR. and removed missing-translation-status This PR does not have translation status info. labels Mar 11, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 11, 2026

@hfxsd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify ecef140 link true /test pull-verify

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@xzhangxian1008
Copy link
Contributor

/assign

Comment on lines +32 to +55
- Region:每个 TiFlash 实例持有的 Region 数量。
- IO Throughput:每个 TiFlash 实例的 I/O 吞吐量。
- Threads CPU:各线程 CPU 使用情况。
- SST Import Service:SST 导入服务相关指标。
- SST Apply:SST 应用相关指标。
- Region Task:Region 任务统计。
- Region Worker:Region worker 线程统计。
- Raft Store:Raft Store 相关状态与统计。
- Apply Worker:Apply worker 相关统计。
- Storage Background (Small Tasks):存储层小型后台任务统计。
- Storage Background (Large Tasks):存储层大型后台任务统计。
- Manual Compaction:手动压缩任务统计。
- GRPC Async Server:gRPC 异步服务端相关统计。
- GRPC Async Client:gRPC 异步客户端相关统计。
- FAP builder:FAP 构建相关统计。
- Snapshot Sender:Snapshot 发送相关统计。
- Segment Scheduler:Segment 调度器相关统计。
- Local Index Pool:本地索引池相关统计。
- Segment Reader:Segment Reader 相关统计。
- Threads:线程数统计。
- Threads state:线程状态分布。
- Threads IO:线程 I/O 相关统计。
- Thread Voluntary Context Switches:线程自愿上下文切换次数。
- Thread Nonvoluntary Context Switches:线程非自愿上下文切换次数。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Region:每个 TiFlash 实例持有的 Region 数量。
- IO Throughput:每个 TiFlash 实例的 I/O 吞吐量。
- Threads CPU:各线程 CPU 使用情况。
- SST Import Service:SST 导入服务相关指标。
- SST Apply:SST 应用相关指标。
- Region Task:Region 任务统计。
- Region Worker:Region worker 线程统计。
- Raft Store:Raft Store 相关状态与统计。
- Apply Worker:Apply worker 相关统计。
- Storage Background (Small Tasks):存储层小型后台任务统计。
- Storage Background (Large Tasks):存储层大型后台任务统计。
- Manual Compaction:手动压缩任务统计。
- GRPC Async Server:gRPC 异步服务端相关统计。
- GRPC Async Client:gRPC 异步客户端相关统计。
- FAP builder:FAP 构建相关统计。
- Snapshot Sender:Snapshot 发送相关统计。
- Segment Scheduler:Segment 调度器相关统计。
- Local Index Pool:本地索引池相关统计。
- Segment Reader:Segment Reader 相关统计。
- Threads:线程数统计。
- Threads state:线程状态分布。
- Threads IO:线程 I/O 相关统计。
- Thread Voluntary Context Switches:线程自愿上下文切换次数。
- Thread Nonvoluntary Context Switches:线程非自愿上下文切换次数。

- Internal Tasks Duration:所有 TiFlash 实例进行内部数据整理任务消耗的时间。
- Page GC Tasks OPM:所有 TiFlash 实例每分钟进行 Delta 部分数据整理任务的次数。
- Page GC Tasks Duration:所有 TiFlash 实例进行 Delta 部分数据整理任务消耗的时间分布。
- FSync Status:fsync 状态统计。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- FSync Status:fsync 状态统计。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. translation/doing This PR’s assignee is translating this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants