[opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230)#62278
Open
mortalBibo wants to merge 1 commit intoapache:masterfrom
Open
[opt](storage) Propagate query cancellation to S3 I/O for OlapScanner (#62230)#62278mortalBibo wants to merge 1 commit intoapache:masterfrom
mortalBibo wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…apache#62230) ### What problem does this PR solve? Issue Number: close apache#62230 Related PR: apache#59476 Problem Summary: In storage-compute separation mode, when a query is cancelled or times out during cold reads from S3/OSS, scanner threads continue executing the page-by-page S3 read chain without detecting cancellation. Although the scanner scheduling layer checks _should_stop between get_block() calls, a single get_block() may read hundreds of pages from S3, each as a synchronous GET request, before returning control to the cancellation check point. This results in unnecessary resource consumption after query cancellation: S3 bandwidth, scanner thread pool occupation, and network connections are wasted on data that will be discarded. Note: The cluster-wide freeze originally described in this issue was resolved by increasing pipeline_executor_size. The root cause of the freeze is likely related to segment footer parsing blocking pipeline executor threads during cold reads, which PR apache#59476 may have optimized from the source. This PR addresses the separate concern of reducing post-cancellation resource waste. Two blocking boundaries lacked cancellation checks: 1. S3FileReader::read_at_impl() — io_ctx parameter was commented out, so the 429 retry loop never checks cancellation. 2. CachedRemoteFileReader::read_at_impl() — FileBlock wait loop blocks up to ~10s without checking cancellation. Fix: Add a non-owning RuntimeState* pointer to IOContext and check RuntimeState::is_cancelled() at both blocking boundaries. Set the pointer at BetaRowsetReader (main data scan path) and OlapScanner's inverted index statistics collection path. Propagate through the inverted index IOContext manual-copy site. Clear in async prefetch to prevent use-after-free. Forward io_ctx on the cache-miss remote fallback read path. Design choice: RuntimeState* follows the existing raw-pointer convention in IOContext (query_id, file_cache_stats, file_reader_stats). Alternatives using weak_ptr or a unified cancellation token were considered but rejected to keep this change minimal and consistent with existing patterns. See PR description for detailed alternatives analysis. ### Known limitations 1. Cannot interrupt an in-flight synchronous S3 GetObject() request; cancellation is detected at the next retry/wait boundary. 2. Segment metadata reads during Segment::open() are not covered. These are structurally different: one-shot, triple-cached (SegmentLoader LRU + StoragePageCache + weak_ptr), and bounded (~7s max). They cannot cause thread pool exhaustion. ### Release note None ### Check List (For Author) - Test - [x] Manual test - Behavior changed: - [x] Yes (cancel/timeout path only; normal execution unchanged) - Does this need documentation? - [x] No
486e537 to
3ac67a2
Compare
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #62230
Related PR: #59476
Problem Summary:
In storage-compute separation mode, when a query is cancelled or times out during cold reads from S3/OSS, scanner threads continue executing the page-by-page S3 read chain without detecting cancellation. Although the scanner scheduling layer checks
_should_stopbetweenget_block()calls, a singleget_block()may read hundreds of pages from S3 — each as a synchronous GET request — before returning control to the cancellation check point.This results in unnecessary resource consumption after query cancellation: S3 bandwidth, scanner thread pool occupation, and network connections are wasted on data that will be discarded.
Note on cluster freeze: The cluster-wide freeze originally described in #62230 was resolved by increasing
pipeline_executor_size. The root cause of the freeze is likely related to segment footer parsing blocking pipeline executor threads during cold reads, which PR #59476 may have optimized from the source. This PR addresses the separate concern of reducing post-cancellation resource waste at the S3 I/O layer.Two blocking boundaries lacked cancellation checks:
S3FileReader::read_at_impl(): Theio_ctxparameter was commented out (const IOContext* /*io_ctx*/), so the S3 429 retry loop sleeps with exponential backoff (~7s total) without ever checking for cancellation.CachedRemoteFileReader::read_at_impl(): The FileBlock wait loop blocks up to ~10s (10 rounds x 1s) waiting for cache population without checking for cancellation.Design choice
Approach chosen: Inject
RuntimeState*intoIOContextAdd a
RuntimeState*pointer field toIOContext, set it at the two places whereIOContextis initialized for query data reads, then checkRuntimeState::is_cancelled()at both blocking boundaries.IOContext propagation chain (already exists, no new plumbing needed):
RuntimeState*is a non-owning pointer that follows the same pattern as the three existing raw pointers inIOContext(query_id,file_cache_stats,file_reader_stats). In the asyncprefetch_range()path, it is explicitly cleared tonullptr— following the existing cleanup pattern for the other pointer fields — to prevent use-after-free when the prefetch task outlives the query lifecycle.Alternatives considered
Why
RuntimeState*instead ofIOContext::should_stop(the FileScanner approach):The existing
should_stopflag inIOContextis set byFileScanner::try_stop(), which is only called for external table scanners. For internal tables,OlapScannerdoes not overridetry_stop(). More critically,IOContextis value-copied through the chain (BetaRowsetReader -> SegmentIterator -> ColumnIterator), so even ifshould_stopwere set somewhere upstream, the value-copied IOContext in the iterator would not see the update.RuntimeState*avoids this because the pointer is copied by value but still dereferences to the same shared state.should_stop(FileScanner)RuntimeState*(this PR)_exec_status+_query_ctx->is_cancelled())try_stop()callertry_stop()in OlapScanner, propagate through IOContext chainWhy not
weak_ptr<QueryContext>:Using
weak_ptr<QueryContext>would provide lifecycle-safe cancellation checking without manual pointer cleanup in async paths. However, this would introducestd::weak_ptr(and<memory>) intoIOContext, changing the struct's copy semantics in a high-frequency value-copy path. It was deemed unnecessarily divergent from the existing IOContext pattern, which uses raw pointers exclusively.Why not
weak_ptr<ResourceContext>+TaskController::is_cancelled():ResourceContextandTaskControllerare Doris's existing task-level cancellation abstractions. This approach is architecturally superior — better abstraction boundaries, natural async safety, and a clear extension path toward unified interrupt semantics. However, it requires adding<memory>toio_common.h, creating a newio_common.cpp, and diverges from the existing raw-pointer pattern. For a targeted optimization, consistency with existing patterns was prioritized over local elegance.Why not
query_id+FragmentMgr::get_query_ctx()lookup:IOContextalready carriesquery_id, so one might look upQueryContexton the fly. However,FragmentMgr::cancel_query()callsremove_query_context(query_id)after cancellation, so the lookup would fail precisely at the moment cancellation needs to be detected — making this approach unreliable.Why not a shared
atomic<bool>cancellation token:A shared cancellation token would unify
cancelandshould_stopsemantics across all scanner types. This is the long-term ideal but requires modifying allIOContextcreation points, allshould_stopusage sites (FileScanner, Parquet, ORC readers), and scanner initialization flows. This is a follow-up architectural improvement, not suitable for a targeted optimization.Changes
be/src/io/io_common.hRuntimeState* runtime_statefieldbe/src/storage/rowset/beta_rowset_reader.cppio_ctx.runtime_statebe/src/exec/scan/olap_scanner.cppio_ctx.runtime_statebe/src/io/fs/s3_file_reader.cppio_ctxparam + add cancellation checkbe/src/io/cache/cached_remote_file_reader.cppio_ctxon cache-miss fallback read; clear in prefetchbe/src/storage/index/inverted/inverted_index_fs_directory.cppruntime_stateinsetIoContext()Scope and known limitations
In scope: The data scan path of
OlapScanner— the page-by-page read loop that drives bulk S3/OSS I/O. After query cancellation, this optimization allows scanner threads to detect cancellation at the next I/O boundary rather than continuing to read all remaining pages.Not in scope (pre-existing, not introduced by this PR):
In-flight synchronous S3
GetObject()requests — The AWS SDKGetObject()call is synchronous and cannot be interrupted. Cancellation is detected at the next retry/wait boundary, not mid-request.Segment metadata reads during
Segment::open()(footer parse, column meta reads viaColumnMetaAccessor/ExternalColMetaUtil). These paths constructIOContextlocally without access toRuntimeState. They are structurally different from data scan reads:SegmentLoaderLRU cache (per-BE, cross-query) +StoragePageCache+weak_ptr— three tiers of cachingDorisCallOnceensures metadata init runs at most once per Segment instanceThreading
RuntimeState*through would require modifyingSegment::open()— a fundamental storage API with a 6-layer call chain used by compaction, schema change, clone, and query paths (15+ call sites). The cost-benefit ratio is prohibitive.Broader stop-propagation semantics (e.g.,
FileScanner::try_stop()for external table scanners).Release note
None
Check List (For Author)
Manual test steps:
SELECT sum(col) FROM table WHERE dt >= '2021-01-01'KILL QUERYor wait for timeoutBehavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)