[opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230) by mortalBibo · Pull Request #62278 · apache/doris

mortalBibo · 2026-04-09T08:26:41Z

What problem does this PR solve?

Issue Number: close #62230

Related PR: #59476

Problem Summary:

In storage-compute separation mode, when a query is cancelled or times out during cold reads from S3/OSS, scanner threads continue executing the page-by-page S3 read chain without detecting cancellation. Although the scanner scheduling layer checks _should_stop between get_block() calls, a single get_block() may read hundreds of pages from S3 — each as a synchronous GET request — before returning control to the cancellation check point.

This results in unnecessary resource consumption after query cancellation: S3 bandwidth, scanner thread pool occupation, and network connections are wasted on data that will be discarded.

Note on cluster freeze: The cluster-wide freeze originally described in #62230 was resolved by increasing pipeline_executor_size. The root cause of the freeze is likely related to segment footer parsing blocking pipeline executor threads during cold reads, which PR #59476 may have optimized from the source. This PR addresses the separate concern of reducing post-cancellation resource waste at the S3 I/O layer.

Two blocking boundaries lacked cancellation checks:

S3FileReader::read_at_impl(): The io_ctx parameter was commented out (const IOContext* /*io_ctx*/), so the S3 429 retry loop sleeps with exponential backoff (~7s total) without ever checking for cancellation.
CachedRemoteFileReader::read_at_impl(): The FileBlock wait loop blocks up to ~10s (10 rounds x 1s) waiting for cache population without checking for cancellation.

Design choice

Approach chosen: Inject RuntimeState* into IOContext

Add a RuntimeState* pointer field to IOContext, set it at the two places where IOContext is initialized for query data reads, then check RuntimeState::is_cancelled() at both blocking boundaries.

IOContext propagation chain (already exists, no new plumbing needed):

BetaRowsetReader::get_segment_iterators()   <- IOContext created here, runtime_state set here
  _read_options.io_ctx (StorageReadOptions, VALUE copy)
    -> SegmentIterator._opts (VALUE copy, pointer members preserved)
      -> ColumnIterator (VALUE copy)
        -> PageIO::read_page(&opts.io_ctx)  (POINTER)
          -> S3FileReader::read_at_impl(io_ctx*)      <- cancellation check added here
          -> CachedRemoteFileReader::read_at_impl(io_ctx*)  <- cancellation check added here

RuntimeState* is a non-owning pointer that follows the same pattern as the three existing raw pointers in IOContext (query_id, file_cache_stats, file_reader_stats). In the async prefetch_range() path, it is explicitly cleared to nullptr — following the existing cleanup pattern for the other pointer fields — to prevent use-after-free when the prefetch task outlives the query lifecycle.

Alternatives considered

Why RuntimeState* instead of IOContext::should_stop (the FileScanner approach):

The existing should_stop flag in IOContext is set by FileScanner::try_stop(), which is only called for external table scanners. For internal tables, OlapScanner does not override try_stop(). More critically, IOContext is value-copied through the chain (BetaRowsetReader -> SegmentIterator -> ColumnIterator), so even if should_stop were set somewhere upstream, the value-copied IOContext in the iterator would not see the update. RuntimeState* avoids this because the pointer is copied by value but still dereferences to the same shared state.

	`should_stop` (FileScanner)	`RuntimeState*` (this PR)
Thread safety	Requires atomic or external sync	Already thread-safe (`_exec_status` + `_query_ctx->is_cancelled()`)
Cancellation sources	Only `try_stop()` caller	All sources: user KILL, query timeout, OOM
Plumbing needed	Override `try_stop()` in OlapScanner, propagate through IOContext chain	Set pointer at IOContext init site only
Intermediate layer changes	ReaderParams, RowsetReaderContext, TabletReader	None

Why not weak_ptr<QueryContext>:

Using weak_ptr<QueryContext> would provide lifecycle-safe cancellation checking without manual pointer cleanup in async paths. However, this would introduce std::weak_ptr (and <memory>) into IOContext, changing the struct's copy semantics in a high-frequency value-copy path. It was deemed unnecessarily divergent from the existing IOContext pattern, which uses raw pointers exclusively.

Why not weak_ptr<ResourceContext> + TaskController::is_cancelled():

ResourceContext and TaskController are Doris's existing task-level cancellation abstractions. This approach is architecturally superior — better abstraction boundaries, natural async safety, and a clear extension path toward unified interrupt semantics. However, it requires adding <memory> to io_common.h, creating a new io_common.cpp, and diverges from the existing raw-pointer pattern. For a targeted optimization, consistency with existing patterns was prioritized over local elegance.

Why not query_id + FragmentMgr::get_query_ctx() lookup:

IOContext already carries query_id, so one might look up QueryContext on the fly. However, FragmentMgr::cancel_query() calls remove_query_context(query_id) after cancellation, so the lookup would fail precisely at the moment cancellation needs to be detected — making this approach unreliable.

Why not a shared atomic<bool> cancellation token:

A shared cancellation token would unify cancel and should_stop semantics across all scanner types. This is the long-term ideal but requires modifying all IOContext creation points, all should_stop usage sites (FileScanner, Parquet, ORC readers), and scanner initialization flows. This is a follow-up architectural improvement, not suitable for a targeted optimization.

Changes

File	Change	Purpose
`be/src/io/io_common.h`	Add forward decl + `RuntimeState* runtime_state` field	Carry cancellation context through I/O chain
`be/src/storage/rowset/beta_rowset_reader.cpp`	Set `io_ctx.runtime_state`	Inject into data scan IOContext
`be/src/exec/scan/olap_scanner.cpp`	Set `io_ctx.runtime_state`	Inject into inverted index statistics collection IOContext
`be/src/io/fs/s3_file_reader.cpp`	Enable `io_ctx` param + add cancellation check	Check before each S3 retry attempt
`be/src/io/cache/cached_remote_file_reader.cpp`	Add cancellation check in wait loop; forward `io_ctx` on cache-miss fallback read; clear in prefetch	Check before each cache wait round; ensure fallback S3 reads see cancellation; prevent UAF in async prefetch
`be/src/storage/index/inverted/inverted_index_fs_directory.cpp`	Copy `runtime_state` in `setIoContext()`	Propagate through inverted index IOContext manual-copy site

Scope and known limitations

In scope: The data scan path of OlapScanner — the page-by-page read loop that drives bulk S3/OSS I/O. After query cancellation, this optimization allows scanner threads to detect cancellation at the next I/O boundary rather than continuing to read all remaining pages.

Not in scope (pre-existing, not introduced by this PR):

In-flight synchronous S3 GetObject() requests — The AWS SDK GetObject() call is synchronous and cannot be interrupted. Cancellation is detected at the next retry/wait boundary, not mid-request.
Segment metadata reads during Segment::open() (footer parse, column meta reads via ColumnMetaAccessor / ExternalColMetaUtil). These paths construct IOContext locally without access to RuntimeState. They are structurally different from data scan reads:
- One-shot: 2-3 S3 GETs per segment (vs. thousands for data scan)
- Cached: SegmentLoader LRU cache (per-BE, cross-query) + StoragePageCache + weak_ptr — three tiers of caching
- Protected: DorisCallOnce ensures metadata init runs at most once per Segment instance
- Bounded: worst case ~7s per segment, does not grow with data volume
Threading RuntimeState* through would require modifying Segment::open() — a fundamental storage API with a 6-layer call chain used by compaction, schema change, clone, and query paths (15+ call sites). The cost-benefit ratio is prohibitive.
Broader stop-propagation semantics (e.g., FileScanner::try_stop() for external table scanners).

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason

Manual test steps:

Set up a Doris cluster in storage-compute separation mode with S3 backend
Create a partitioned table with historical data spanning several years
Execute a large cold scan query: SELECT sum(col) FROM table WHERE dt >= '2021-01-01'
Cancel the query with KILL QUERY or wait for timeout
Verify scanner threads release promptly and subsequent queries execute normally

Behavior changed:
- No.
- Yes. After query cancel/timeout, scanner threads now detect cancellation at S3 retry and cache-wait boundaries instead of continuing to read all remaining pages. Only affects the cancel/timeout path; normal query execution is unchanged.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-04-09T08:26:58Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

…apache#62230) ### What problem does this PR solve? Issue Number: close apache#62230 Related PR: apache#59476 Problem Summary: In storage-compute separation mode, when a query is cancelled or times out during cold reads from S3/OSS, scanner threads continue executing the page-by-page S3 read chain without detecting cancellation. Although the scanner scheduling layer checks _should_stop between get_block() calls, a single get_block() may read hundreds of pages from S3, each as a synchronous GET request, before returning control to the cancellation check point. This results in unnecessary resource consumption after query cancellation: S3 bandwidth, scanner thread pool occupation, and network connections are wasted on data that will be discarded. Note: The cluster-wide freeze originally described in this issue was resolved by increasing pipeline_executor_size. The root cause of the freeze is likely related to segment footer parsing blocking pipeline executor threads during cold reads, which PR apache#59476 may have optimized from the source. This PR addresses the separate concern of reducing post-cancellation resource waste. Two blocking boundaries lacked cancellation checks: 1. S3FileReader::read_at_impl() — io_ctx parameter was commented out, so the 429 retry loop never checks cancellation. 2. CachedRemoteFileReader::read_at_impl() — FileBlock wait loop blocks up to ~10s without checking cancellation. Fix: Add a non-owning RuntimeState* pointer to IOContext and check RuntimeState::is_cancelled() at both blocking boundaries. Set the pointer at BetaRowsetReader (main data scan path) and OlapScanner's inverted index statistics collection path. Propagate through the inverted index IOContext manual-copy site. Clear in async prefetch to prevent use-after-free. Forward io_ctx on the cache-miss remote fallback read path. Design choice: RuntimeState* follows the existing raw-pointer convention in IOContext (query_id, file_cache_stats, file_reader_stats). Alternatives using weak_ptr or a unified cancellation token were considered but rejected to keep this change minimal and consistent with existing patterns. See PR description for detailed alternatives analysis. ### Known limitations 1. Cannot interrupt an in-flight synchronous S3 GetObject() request; cancellation is detected at the next retry/wait boundary. 2. Segment metadata reads during Segment::open() are not covered. These are structurally different: one-shot, triple-cached (SegmentLoader LRU + StoragePageCache + weak_ptr), and bounded (~7s max). They cannot cause thread pool exhaustion. ### Release note None ### Check List (For Author) - Test - [x] Manual test - Behavior changed: - [x] Yes (cancel/timeout path only; normal execution unchanged) - Does this need documentation? - [x] No

mortalBibo requested review from airborne12, dataroaring and gavinchou as code owners April 9, 2026 08:26

mortalBibo force-pushed the fix/olap-cancel-s3-io branch from 486e537 to 3ac67a2 Compare April 9, 2026 13:36

mortalBibo changed the title ~~[fix](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230)~~ [opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230) Apr 9, 2026

mortalBibo mentioned this pull request Apr 9, 2026

[Bug] OlapScanner does not propagate query cancellation to S3 I/O, causing scanner thread pool exhaustion #62230

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230)#62278

[opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230)#62278
mortalBibo wants to merge 1 commit intoapache:masterfrom
mortalBibo:fix/olap-cancel-s3-io

mortalBibo commented Apr 9, 2026 •

edited

Loading

Uh oh!

hello-stephen commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mortalBibo commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Design choice

Alternatives considered

Changes

Scope and known limitations

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mortalBibo commented Apr 9, 2026 •

edited

Loading