Skip to content

dxf: make max concurrent task configurable#69645

Open
D3Hunter wants to merge 2 commits into
pingcap:masterfrom
D3Hunter:codex/dxf-max-concurrent-formal
Open

dxf: make max concurrent task configurable#69645
D3Hunter wants to merge 2 commits into
pingcap:masterfrom
D3Hunter:codex/dxf-max-concurrent-formal

Conversation

@D3Hunter

@D3Hunter D3Hunter commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: ref #61702

Problem Summary:

DXF service runs in the SYSTEM keyspace for nextgen clusters and shares a resource pool across user keyspaces. The framework-wide max concurrent task limit was a fixed in-memory package variable, which made the limit hard to tune when many small tasks were blocked by the default scheduler concurrency.

What changed and how does it work?

This PR makes the DXF max concurrent task limit atomically configurable in memory:

  • Replaces the plain proto.MaxConcurrentTask global with atomic-backed helpers.
  • Keeps the default and lower bound at 16, caps runtime updates at 1000, and keeps the upper cap available for scheduler buffer sizing.
  • Adds GET /dxf/schedule/max_concurrent_task and POST /dxf/schedule/max_concurrent_task?value=<n> under the existing SYSTEM-keyspace DXF HTTP route block.
  • Returns the current value with persistence: "memory_only" so callers know the update is not persisted.
  • Updates scheduler and storage paths to read the limit through atomic loads.
  • Documents the process-local tuning contract and the finishCh capacity invariant.
  • Keeps test-only override support for existing low-limit scheduler tests and avoids discard-only restore calls in storage tests.

The value is intentionally not persisted in this first version; it only changes the in-memory value of the TiDB process receiving the HTTP request. Operators should send the request to the current DXF owner when tuning scheduler concurrency.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Manual test details:

  • go test -run TestMaxConcurrentTask -tags=intest,deadlock -count=1 ./pkg/dxf/framework/proto
  • ./tools/check/failpoint-go-test.sh pkg/server/handler/tests -run 'TestDXFAPI/max.*concurrent.*task.*api' -count=1 -tags=intest,deadlock,nextgen
  • ./tools/check/failpoint-go-test.sh pkg/dxf/framework/storage -run TestGetTopUnfinishedTasks -count=1
  • ./tools/check/failpoint-go-test.sh pkg/dxf/framework/scheduler -run '^$' -count=1
  • make lint
  • git diff --check
  • git diff --cached --check
  • make bazel_prepare was not required: no Go files were added, removed, renamed, or moved; no Go import sections changed; no Bazel or module files changed; and no new top-level Go test function was added.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Add `/dxf/schedule/max_concurrent_task` to configure the process-local, in-memory DXF max concurrent task limit.

Summary by CodeRabbit

  • New Features

    • Added an API to view and update the task concurrency limit at runtime.
    • Exposed a new HTTP route for managing the maximum number of concurrent tasks.
  • Bug Fixes

    • Task scheduling now respects the current concurrency setting consistently across scheduler and storage operations.
    • Improved handling of task limit values with validation for allowed ranges.
  • Tests

    • Expanded coverage for the new concurrency settings and API behavior.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 3, 2026
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Replaces the exported proto.MaxConcurrentTask variable with an atomic, range-validated configuration exposing GetMaxConcurrentTask, SetMaxConcurrentTask, and a test helper. Updates scheduler and storage code to use the new getter, adds an HTTP endpoint to view/update the limit, and migrates tests/benchmarks accordingly.

Changes

Max Concurrent Task Configuration

Layer / File(s) Summary
Atomic config API and constants
pkg/dxf/framework/proto/task.go, pkg/dxf/framework/proto/task_test.go, pkg/dxf/framework/proto/BUILD.bazel
Introduces atomic.Int64-backed storage with DefaultMaxConcurrentTask, MinMaxConcurrentTask, MaxMaxConcurrentTask constants, GetMaxConcurrentTask, SetMaxConcurrentTask (range-validated), and SetMaxConcurrentTaskForTest, replacing the removed MaxConcurrentTask variable, with new unit test and shard count bump.
Scheduler and storage usage of dynamic limit
pkg/dxf/framework/scheduler/scheduler_manager.go, pkg/dxf/framework/scheduler/interface.go, pkg/dxf/framework/storage/task_table.go
Concurrency checks and query limits in scheduler and storage code now call proto.GetMaxConcurrentTask(); finishCh buffer sized via proto.MaxMaxConcurrentTask.
Max concurrent task API endpoint
pkg/server/handler/tikvhandler/dxf.go, pkg/server/http_status.go, pkg/server/handler/tests/dxf_test.go
Adds DXFTaskMaxConcurrentHandler supporting GET/POST on /dxf/task/max_concurrent to read/update the limit, registers the route, and adds an API test covering validation and update behavior.
Test and benchmark migration
pkg/dxf/framework/integrationtests/bench_test.go, pkg/dxf/framework/scheduler/scheduler_manager_nokit_test.go, pkg/dxf/framework/scheduler/scheduler_test.go, pkg/dxf/framework/storage/table_test.go, tests/realtikvtest/addindextest1/disttask_test.go
Replaces direct assignment/restore of proto.MaxConcurrentTask with proto.SetMaxConcurrentTaskForTest, using its returned restore function or t.Cleanup.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant HTTPServer
  participant DXFTaskMaxConcurrentHandler
  participant proto

  Client->>HTTPServer: GET/POST /dxf/task/max_concurrent
  HTTPServer->>DXFTaskMaxConcurrentHandler: ServeHTTP(req)
  alt POST with value
    DXFTaskMaxConcurrentHandler->>proto: SetMaxConcurrentTask(value)
    proto-->>DXFTaskMaxConcurrentHandler: ok or range error
  else GET
    DXFTaskMaxConcurrentHandler->>proto: GetMaxConcurrentTask()
    proto-->>DXFTaskMaxConcurrentHandler: current value
  end
  DXFTaskMaxConcurrentHandler-->>Client: JSON {max_concurrent_task}
Loading

Suggested labels: approved, lgtm, ok-to-test

Suggested reviewers: joechenrh, wjhuang2016, GMHDBJD

Poem

A rabbit hopped through atomic gates,
No more fixed sixteen, now the code negotiates,
GET to peek, POST to change,
Sixteen to a thousand — quite a range!
🐇⚙️ Concurrency dances, tests restore,
Hop, hop, hooray for one clean core!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title is concise and accurately summarizes the main change: making the DXF max concurrent task configurable.
Description check ✅ Passed The description follows the template and includes the issue, problem summary, change summary, tests, side effects, docs, and release note.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/dxf/framework/proto/task.go (1)

97-104: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Document the validation-bypass and parallel-test caveat.

SetMaxConcurrentTaskForTest intentionally skips the [Min, Max] range check SetMaxConcurrentTask enforces (callers pass values like 1, 3, 4, 6 below MinMaxConcurrentTask). It also mutates process-global state, so like t.Setenv/t.Chdir it isn't safe to use from parallel tests. Neither point is captured in the doc comment.

📝 Suggested doc comment
-// SetMaxConcurrentTaskForTest updates the max concurrency of task and returns a restore function.
+// SetMaxConcurrentTaskForTest updates the max concurrency of task for test purposes,
+// bypassing the [MinMaxConcurrentTask, MaxMaxConcurrentTask] validation done by
+// SetMaxConcurrentTask, and returns a function that restores the previous value.
+// Since this mutates process-global state, it must not be used from parallel tests.
 func SetMaxConcurrentTaskForTest(value int) func() {

As per coding guidelines, "Comments SHOULD explain non-obvious intent, constraints, invariants, concurrency guarantees, SQL/compatibility contracts, or important performance trade-offs."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/dxf/framework/proto/task.go` around lines 97 - 104, Update the doc
comment on SetMaxConcurrentTaskForTest to mention that it intentionally bypasses
the MinMaxConcurrentTask validation enforced by SetMaxConcurrentTask and that it
mutates process-global state, so it must not be used from parallel tests. Keep
the note close to the function name so callers understand the validation-bypass
and concurrency caveat when using SetMaxConcurrentTaskForTest.

Source: Coding guidelines

pkg/dxf/framework/scheduler/scheduler_manager.go (1)

158-158: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider a brief comment explaining why finishCh is sized to MaxMaxConcurrentTask.

Since the channel capacity is fixed at creation time but GetMaxConcurrentTask() can later be raised at runtime (up to MaxMaxConcurrentTask), sizing to the static upper bound rather than the current/default value is the correct choice to avoid blocking on future increases. This rationale isn't obvious from the code alone.

📝 Suggested comment
-		finishCh: make(chan struct{}, proto.MaxMaxConcurrentTask),
+		// sized to the upper bound since GetMaxConcurrentTask() can be raised
+		// at runtime after this channel is created.
+		finishCh: make(chan struct{}, proto.MaxMaxConcurrentTask),

As per coding guidelines, "Comments SHOULD explain non-obvious intent, constraints, invariants, concurrency guarantees... SHOULD NOT restate what the code already makes clear."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/dxf/framework/scheduler/scheduler_manager.go` at line 158, Add a brief
explanatory comment at the finishCh initialization in scheduler_manager.go to
capture the non-obvious intent: it is sized to proto.MaxMaxConcurrentTask
because GetMaxConcurrentTask() may be increased later at runtime, so the channel
must be preallocated to the highest possible concurrency to avoid blocking. Keep
the comment near the schedulerManager construction / finishCh field assignment
so the rationale is tied to the finishCh capacity choice.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/dxf/framework/proto/task.go`:
- Around line 97-104: Update the doc comment on SetMaxConcurrentTaskForTest to
mention that it intentionally bypasses the MinMaxConcurrentTask validation
enforced by SetMaxConcurrentTask and that it mutates process-global state, so it
must not be used from parallel tests. Keep the note close to the function name
so callers understand the validation-bypass and concurrency caveat when using
SetMaxConcurrentTaskForTest.

In `@pkg/dxf/framework/scheduler/scheduler_manager.go`:
- Line 158: Add a brief explanatory comment at the finishCh initialization in
scheduler_manager.go to capture the non-obvious intent: it is sized to
proto.MaxMaxConcurrentTask because GetMaxConcurrentTask() may be increased later
at runtime, so the channel must be preallocated to the highest possible
concurrency to avoid blocking. Keep the comment near the schedulerManager
construction / finishCh field assignment so the rationale is tied to the
finishCh capacity choice.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 43b68bc5-91bc-430a-874c-a824f0c18867

📥 Commits

Reviewing files that changed from the base of the PR and between 0ac299d and 8650f6f.

📒 Files selected for processing (14)
  • pkg/dxf/framework/integrationtests/bench_test.go
  • pkg/dxf/framework/proto/BUILD.bazel
  • pkg/dxf/framework/proto/task.go
  • pkg/dxf/framework/proto/task_test.go
  • pkg/dxf/framework/scheduler/interface.go
  • pkg/dxf/framework/scheduler/scheduler_manager.go
  • pkg/dxf/framework/scheduler/scheduler_manager_nokit_test.go
  • pkg/dxf/framework/scheduler/scheduler_test.go
  • pkg/dxf/framework/storage/table_test.go
  • pkg/dxf/framework/storage/task_table.go
  • pkg/server/handler/tests/dxf_test.go
  • pkg/server/handler/tikvhandler/dxf.go
  • pkg/server/http_status.go
  • tests/realtikvtest/addindextest1/disttask_test.go

@D3Hunter D3Hunter changed the base branch from release-nextgen-202603 to master July 3, 2026 10:16
@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 3, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

@D3Hunter: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-ddlv1 8650f6f link true /test pull-unit-test-ddlv1
pull-lightning-integration-test 8650f6f link true /test pull-lightning-integration-test
pull-br-integration-test 8650f6f link true /test pull-br-integration-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@D3Hunter D3Hunter force-pushed the codex/dxf-max-concurrent-formal branch from 8650f6f to ab0568e Compare July 3, 2026 10:44
@ti-chi-bot

ti-chi-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wjhuang2016 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 3, 2026
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 20.37037% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.0981%. Comparing base (8be4bd0) to head (e6d7326).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #69645        +/-   ##
================================================
- Coverage   76.3268%   74.0981%   -2.2288%     
================================================
  Files          2041       2050         +9     
  Lines        560589     576634     +16045     
================================================
- Hits         427880     427275       -605     
- Misses       131808     149151     +17343     
+ Partials        901        208       -693     
Flag Coverage Δ
integration 40.7274% <20.3703%> (+1.0222%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4471% <ø> (ø)
parser ∅ <ø> (∅)
br 47.4650% <ø> (-15.2563%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ingress-bot

Copy link
Copy Markdown

🔍 Starting code review for this PR...

@ingress-bot ingress-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

  • Total findings: 8
  • Inline comments: 8
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

🟡 [Minor] (4)

  1. DefaultMaxConcurrentTask and MinMaxConcurrentTask are co-equal independent literals with no explicit link, creating silent drift when either changes (pkg/dxf/framework/proto/task.go:89, pkg/dxf/framework/proto/task.go:91, pkg/dxf/framework/proto/task.go:98)
  2. MaxMaxConcurrentTask and MinMaxConcurrentTask use double-compound naming (pkg/dxf/framework/proto/task.go:91, pkg/dxf/framework/proto/task.go:93, pkg/dxf/framework/scheduler/scheduler_manager.go:158)
  3. SetMaxConcurrentTaskForTest is exported from a non-test file with no compile-time restriction to test code (pkg/dxf/framework/proto/task.go:118)
  4. max_concurrent knob is process-local and ephemeral, unlike the persisted sibling DXF tune endpoint (pkg/server/handler/tikvhandler/dxf.go:368, pkg/server/http_status.go:258, pkg/dxf/framework/proto/task.go:120)

🧹 [Nit] (4)

  1. finishCh buffer size change to MaxMaxConcurrentTask lacks an invariant comment (pkg/dxf/framework/scheduler/scheduler_manager.go:158)
  2. Three mid-test calls to SetMaxConcurrentTaskForTest silently discard the restore function, using it as a plain setter (pkg/dxf/framework/storage/table_test.go:553, pkg/dxf/framework/storage/table_test.go:559, pkg/dxf/framework/storage/table_test.go:564)
  3. DXFTaskMaxConcurrentHandler.ServeHTTP missing doc comment and persistence contract (pkg/server/handler/tikvhandler/dxf.go:355, pkg/server/http_status.go:419)
  4. Original TODO 'remove this limit later' was silently dropped without noting disposition (pkg/dxf/framework/proto/task.go:87)

Comment thread pkg/dxf/framework/proto/task.go Outdated
Comment thread pkg/dxf/framework/proto/task.go Outdated
Comment thread pkg/dxf/framework/proto/task.go
Comment thread pkg/server/handler/tikvhandler/dxf.go
Comment thread pkg/dxf/framework/scheduler/scheduler_manager.go Outdated
Comment thread pkg/dxf/framework/storage/table_test.go Outdated
Comment thread pkg/server/handler/tikvhandler/dxf.go
Comment thread pkg/dxf/framework/proto/task.go
@D3Hunter D3Hunter changed the title dxf, server: make max concurrent task configurable dxf: make max concurrent task configurable Jul 3, 2026
@D3Hunter

D3Hunter commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

/cherry-pick release-nextgen-202603

@ti-chi-bot

Copy link
Copy Markdown
Member

@D3Hunter: once the present PR merges, I will cherry-pick it on top of release-nextgen-202603 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-nextgen-202603

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants