Skip to content

[opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230)#62278

Open
mortalBibo wants to merge 1 commit intoapache:masterfrom
mortalBibo:fix/olap-cancel-s3-io
Open

[opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230)#62278
mortalBibo wants to merge 1 commit intoapache:masterfrom
mortalBibo:fix/olap-cancel-s3-io

Conversation

@mortalBibo
Copy link
Copy Markdown

@mortalBibo mortalBibo commented Apr 9, 2026

What problem does this PR solve?

Issue Number: close #62230

Related PR: #59476

Problem Summary:

In storage-compute separation mode, when a query is cancelled or times out during cold reads from S3/OSS, scanner threads continue executing the page-by-page S3 read chain without detecting cancellation. Although the scanner scheduling layer checks _should_stop between get_block() calls, a single get_block() may read hundreds of pages from S3 — each as a synchronous GET request — before returning control to the cancellation check point.

This results in unnecessary resource consumption after query cancellation: S3 bandwidth, scanner thread pool occupation, and network connections are wasted on data that will be discarded.

Note on cluster freeze: The cluster-wide freeze originally described in #62230 was resolved by increasing pipeline_executor_size. The root cause of the freeze is likely related to segment footer parsing blocking pipeline executor threads during cold reads, which PR #59476 may have optimized from the source. This PR addresses the separate concern of reducing post-cancellation resource waste at the S3 I/O layer.

Two blocking boundaries lacked cancellation checks:

  1. S3FileReader::read_at_impl(): The io_ctx parameter was commented out (const IOContext* /*io_ctx*/), so the S3 429 retry loop sleeps with exponential backoff (~7s total) without ever checking for cancellation.
  2. CachedRemoteFileReader::read_at_impl(): The FileBlock wait loop blocks up to ~10s (10 rounds x 1s) waiting for cache population without checking for cancellation.

Design choice

Approach chosen: Inject RuntimeState* into IOContext

Add a RuntimeState* pointer field to IOContext, set it at the two places where IOContext is initialized for query data reads, then check RuntimeState::is_cancelled() at both blocking boundaries.

IOContext propagation chain (already exists, no new plumbing needed):

BetaRowsetReader::get_segment_iterators()   <- IOContext created here, runtime_state set here
  _read_options.io_ctx (StorageReadOptions, VALUE copy)
    -> SegmentIterator._opts (VALUE copy, pointer members preserved)
      -> ColumnIterator (VALUE copy)
        -> PageIO::read_page(&opts.io_ctx)  (POINTER)
          -> S3FileReader::read_at_impl(io_ctx*)      <- cancellation check added here
          -> CachedRemoteFileReader::read_at_impl(io_ctx*)  <- cancellation check added here

RuntimeState* is a non-owning pointer that follows the same pattern as the three existing raw pointers in IOContext (query_id, file_cache_stats, file_reader_stats). In the async prefetch_range() path, it is explicitly cleared to nullptr — following the existing cleanup pattern for the other pointer fields — to prevent use-after-free when the prefetch task outlives the query lifecycle.

Alternatives considered

Why RuntimeState* instead of IOContext::should_stop (the FileScanner approach):

The existing should_stop flag in IOContext is set by FileScanner::try_stop(), which is only called for external table scanners. For internal tables, OlapScanner does not override try_stop(). More critically, IOContext is value-copied through the chain (BetaRowsetReader -> SegmentIterator -> ColumnIterator), so even if should_stop were set somewhere upstream, the value-copied IOContext in the iterator would not see the update. RuntimeState* avoids this because the pointer is copied by value but still dereferences to the same shared state.

should_stop (FileScanner) RuntimeState* (this PR)
Thread safety Requires atomic or external sync Already thread-safe (_exec_status + _query_ctx->is_cancelled())
Cancellation sources Only try_stop() caller All sources: user KILL, query timeout, OOM
Plumbing needed Override try_stop() in OlapScanner, propagate through IOContext chain Set pointer at IOContext init site only
Intermediate layer changes ReaderParams, RowsetReaderContext, TabletReader None

Why not weak_ptr<QueryContext>:

Using weak_ptr<QueryContext> would provide lifecycle-safe cancellation checking without manual pointer cleanup in async paths. However, this would introduce std::weak_ptr (and <memory>) into IOContext, changing the struct's copy semantics in a high-frequency value-copy path. It was deemed unnecessarily divergent from the existing IOContext pattern, which uses raw pointers exclusively.

Why not weak_ptr<ResourceContext> + TaskController::is_cancelled():

ResourceContext and TaskController are Doris's existing task-level cancellation abstractions. This approach is architecturally superior — better abstraction boundaries, natural async safety, and a clear extension path toward unified interrupt semantics. However, it requires adding <memory> to io_common.h, creating a new io_common.cpp, and diverges from the existing raw-pointer pattern. For a targeted optimization, consistency with existing patterns was prioritized over local elegance.

Why not query_id + FragmentMgr::get_query_ctx() lookup:

IOContext already carries query_id, so one might look up QueryContext on the fly. However, FragmentMgr::cancel_query() calls remove_query_context(query_id) after cancellation, so the lookup would fail precisely at the moment cancellation needs to be detected — making this approach unreliable.

Why not a shared atomic<bool> cancellation token:

A shared cancellation token would unify cancel and should_stop semantics across all scanner types. This is the long-term ideal but requires modifying all IOContext creation points, all should_stop usage sites (FileScanner, Parquet, ORC readers), and scanner initialization flows. This is a follow-up architectural improvement, not suitable for a targeted optimization.

Changes

File Change Purpose
be/src/io/io_common.h Add forward decl + RuntimeState* runtime_state field Carry cancellation context through I/O chain
be/src/storage/rowset/beta_rowset_reader.cpp Set io_ctx.runtime_state Inject into data scan IOContext
be/src/exec/scan/olap_scanner.cpp Set io_ctx.runtime_state Inject into inverted index statistics collection IOContext
be/src/io/fs/s3_file_reader.cpp Enable io_ctx param + add cancellation check Check before each S3 retry attempt
be/src/io/cache/cached_remote_file_reader.cpp Add cancellation check in wait loop; forward io_ctx on cache-miss fallback read; clear in prefetch Check before each cache wait round; ensure fallback S3 reads see cancellation; prevent UAF in async prefetch
be/src/storage/index/inverted/inverted_index_fs_directory.cpp Copy runtime_state in setIoContext() Propagate through inverted index IOContext manual-copy site

Scope and known limitations

In scope: The data scan path of OlapScanner — the page-by-page read loop that drives bulk S3/OSS I/O. After query cancellation, this optimization allows scanner threads to detect cancellation at the next I/O boundary rather than continuing to read all remaining pages.

Not in scope (pre-existing, not introduced by this PR):

  1. In-flight synchronous S3 GetObject() requests — The AWS SDK GetObject() call is synchronous and cannot be interrupted. Cancellation is detected at the next retry/wait boundary, not mid-request.

  2. Segment metadata reads during Segment::open() (footer parse, column meta reads via ColumnMetaAccessor / ExternalColMetaUtil). These paths construct IOContext locally without access to RuntimeState. They are structurally different from data scan reads:

    • One-shot: 2-3 S3 GETs per segment (vs. thousands for data scan)
    • Cached: SegmentLoader LRU cache (per-BE, cross-query) + StoragePageCache + weak_ptr — three tiers of caching
    • Protected: DorisCallOnce ensures metadata init runs at most once per Segment instance
    • Bounded: worst case ~7s per segment, does not grow with data volume

    Threading RuntimeState* through would require modifying Segment::open() — a fundamental storage API with a 6-layer call chain used by compaction, schema change, clone, and query paths (15+ call sites). The cost-benefit ratio is prohibitive.

  3. Broader stop-propagation semantics (e.g., FileScanner::try_stop() for external table scanners).

Release note

None

Check List (For Author)

  • Test
    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason

Manual test steps:

  1. Set up a Doris cluster in storage-compute separation mode with S3 backend
  2. Create a partitioned table with historical data spanning several years
  3. Execute a large cold scan query: SELECT sum(col) FROM table WHERE dt >= '2021-01-01'
  4. Cancel the query with KILL QUERY or wait for timeout
  5. Verify scanner threads release promptly and subsequent queries execute normally
  • Behavior changed:

    • No.
    • Yes. After query cancel/timeout, scanner threads now detect cancellation at S3 retry and cache-wait boundaries instead of continuing to read all remaining pages. Only affects the cancel/timeout path; normal query execution is unchanged.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

…apache#62230)

### What problem does this PR solve?

Issue Number: close apache#62230

Related PR: apache#59476

Problem Summary:
In storage-compute separation mode, when a query is cancelled or times
out during cold reads from S3/OSS, scanner threads continue executing
the page-by-page S3 read chain without detecting cancellation. Although
the scanner scheduling layer checks _should_stop between get_block()
calls, a single get_block() may read hundreds of pages from S3, each
as a synchronous GET request, before returning control to the
cancellation check point.

This results in unnecessary resource consumption after query
cancellation: S3 bandwidth, scanner thread pool occupation, and
network connections are wasted on data that will be discarded.

Note: The cluster-wide freeze originally described in this issue was
resolved by increasing pipeline_executor_size. The root cause of the
freeze is likely related to segment footer parsing blocking pipeline
executor threads during cold reads, which PR apache#59476 may have optimized
from the source. This PR addresses the separate concern of reducing
post-cancellation resource waste.

Two blocking boundaries lacked cancellation checks:
1. S3FileReader::read_at_impl() — io_ctx parameter was commented out,
   so the 429 retry loop never checks cancellation.
2. CachedRemoteFileReader::read_at_impl() — FileBlock wait loop blocks
   up to ~10s without checking cancellation.

Fix:
Add a non-owning RuntimeState* pointer to IOContext and check
RuntimeState::is_cancelled() at both blocking boundaries. Set the
pointer at BetaRowsetReader (main data scan path) and OlapScanner's
inverted index statistics collection path. Propagate through the
inverted index IOContext manual-copy site. Clear in async prefetch
to prevent use-after-free. Forward io_ctx on the cache-miss remote
fallback read path.

Design choice:
RuntimeState* follows the existing raw-pointer convention in IOContext
(query_id, file_cache_stats, file_reader_stats). Alternatives using
weak_ptr or a unified cancellation token were considered but rejected
to keep this change minimal and consistent with existing patterns.
See PR description for detailed alternatives analysis.

### Known limitations

1. Cannot interrupt an in-flight synchronous S3 GetObject() request;
   cancellation is detected at the next retry/wait boundary.
2. Segment metadata reads during Segment::open() are not covered.
   These are structurally different: one-shot, triple-cached
   (SegmentLoader LRU + StoragePageCache + weak_ptr), and bounded
   (~7s max). They cannot cause thread pool exhaustion.

### Release note

None

### Check List (For Author)

- Test
    - [x] Manual test
- Behavior changed:
    - [x] Yes (cancel/timeout path only; normal execution unchanged)
- Does this need documentation?
    - [x] No
@mortalBibo mortalBibo force-pushed the fix/olap-cancel-s3-io branch from 486e537 to 3ac67a2 Compare April 9, 2026 13:36
@mortalBibo mortalBibo changed the title [fix](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230) [opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230) Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] OlapScanner does not propagate query cancellation to S3 I/O, causing scanner thread pool exhaustion

2 participants