Skip to content

[v25.3.x] Implement cross-segment prefetching for small segments in cloud storage reads#29795

Open
vbotbuildovich wants to merge 4 commits intoredpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-29496-v25.3.x-932
Open

[v25.3.x] Implement cross-segment prefetching for small segments in cloud storage reads#29795
vbotbuildovich wants to merge 4 commits intoredpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-29496-v25.3.x-932

Conversation

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Backport of PR #29496

Add `try_get_segment_units()` which attempts to acquire segment units
without blocking. This is useful for optional segment hydrations from
prefetching or otherwise.

(cherry picked from commit e15fbda)
While the existing chunk prefetching works well for larger segments it
doesn't help when there is many small segments due to compaction or
otherwise. Prior to this commit we'd end up having to read and wait for
each small segment from cloud storage which greatly reduced throughput.

Now when reading small segments the reader can prefetch additional small
segments in parallel(up to `cloud_storage_prefetch_segments_max`) to
increase throughput.

(cherry picked from commit 5f84706)
When multiple callers (e.g., segment prefetch and reader) concurrently
request hydration for the same segment and the index download fails,
a race condition could cause some callers to receive an unhandled
exception instead of retrying with legacy mode.

The issue occurred because:
1. Multiple callers add promises to the shared _wait_list
2. Index download fails, all waiters receive the exception
3. First caller to handle the exception sets _fallback_mode and
 retries
4. Subsequent callers see _fallback_mode is already set, causing the
 condition (ex.path == _index_path && !_fallback_mode) to be false
5. These callers rethrow the exception instead of retrying

This commit changes the logic of do_hydrate to always retry if the
failed download was the index.

(cherry picked from commit 9fd4a11)
@vbotbuildovich vbotbuildovich requested a review from a team as a code owner March 11, 2026 10:34
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Mar 11, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Mar 11, 2026
@vbotbuildovich
Copy link
Copy Markdown
Collaborator Author

Retry command for Build#81615

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/shard_placement_test.py::ShardPlacementTest.test_upgrade

@vbotbuildovich
Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#81615
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShardPlacementTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/81615#019cdc93-d6e6-449f-8353-1fcf321eb7ee FLAKY 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShardPlacementTest&test_method=test_upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants