Skip to content

fix(querier): de-flake store-gateway limits integration tests by waiting for ACTIVE store-gateway in querier ring view#7614

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-querier-sg-databytes-limits-sg-active-wait
Open

fix(querier): de-flake store-gateway limits integration tests by waiting for ACTIVE store-gateway in querier ring view#7614
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-querier-sg-databytes-limits-sg-active-wait

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does

De-flakes two integration_querier tests that expect every query to fail with 422 from a store-gateway limit but intermittently get a 500 instead (issue #7606, observed on arm64 CI, run 26832378915):

querier_test.go:564:
        Not equal:
        expected: 422
        actual  : 500

The fix is test-only: after the existing "ring tokens registered + blocks loaded" waits, also wait until the querier's view of the store-gateway ring reports the store-gateway ACTIVE, in:

The wait is the same idiom already used in integration/backward_compatibility_test.go (#5975):

	// Wait until the store-gateway is ACTIVE in the querier's view of the store-gateway ring. The
	// store-gateway registers JOINING (with tokens) and switches to ACTIVE only after the initial
	// blocks sync, so the waits above can pass while the querier would still fail queries with
	// "at least 1 healthy replica required, could only find 0" (HTTP 500) instead of the expected 422.
	require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers(
		labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"),
		labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"))))

Root cause (decoded from the failing CI runs)

We decoded the gzipped 500 response bodies from the failing CI runs. All occurrences (#7606 run 26832378915; #7605 runs 26832378915 and 26632776611) are the byte-identical querier-local error:

expanding series: failed to get store-gateway replication set owning the block <ULID>: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095

produced by the ring lookup (pkg/querier/blocks_store_replicated_set.go:127pkg/ring/replication_strategy.go:93) before any store-gateway RPC. The failing query took ~4ms and has zero store-gateway log lines in its window — it never reached limiter code on the store-gateway.

Why the existing waits don't close the window:

  • The store-gateway registers in the ring as JOINING already owning tokens, and flips to ACTIVE only after its initial blocks sync.
  • The querier's BlocksRead ring operation only selects ACTIVE instances (pkg/storegateway/gateway_ring.go:49).
  • The querier's consul watch is rate-limited (default 1 rps, pkg/ring/kv/consul/client.go:79), so its ring view lags the store-gateway's state flip.

So cortex_ring_tokens_total == 512*2 on the querier and cortex_bucket_store_blocks_loaded == 1 on the store-gateway can all pass while the querier's ring view still says JOINING → the replication-set lookup fails → 500 instead of 422.

Why this wait is deterministic (and not just a bigger timeout): the cortex_ring_members gauge is computed under the ring's write lock from the same ringDesc that Ring.Get consults (pkg/ring/ring.go:380, updateRingMetrics at :662); once a scrape observes ACTIVE == 1, the failing predicate is permanently false (the state cannot regress absent a store-gateway crash). The state="ACTIVE" series is zero-filled in the same locked ring update, and the preceding 1024-token wait proves the querier has already processed a ring update containing the store-gateway — so the bare matcher form (no e2e.WaitMissingMetrics) is safe and matches the #5975 precedent.

Correcting the originally filed hypothesis (falsified)

Issue #7606 was filed with the hypothesis that the store-gateway log line series size exceeded expected size; refetching ... maxSeriesSize=1 near the failure meant the bytes-limit error was re-wrapped on the vendored Thanos refetch path and lost its 422/ResourceExhausted coding, so the querier mapped it to 500. Investigation falsified that hypothesis; this PR corrects the record:

  • The "series size exceeded; refetching" and "exceeded series limit" log lines near the failure belong to earlier, passing tests in the same CI job. The failing query never reached limiter code (see the decoded body above).
  • We additionally audited the suspected 422-loss class and found it unreachable on the pinned vendor: (1) Cortex's limiter already returns a status-coded error (pkg/storegateway/limiter.go:31); (2) all 10 Thanos consumption sites immediately re-code it ResourceExhausted (vendor/github.com/thanos-io/thanos/pkg/store/bucket.go 1256/1408/1417/3091/3168/3226/3403/3430/3743/3848), including the refetch recursion (3459 → 3430) and the lazy-postings limits; (3) Thanos' final conversions preserve the code — status.FromError(errors.Cause(err)) at bucket.go:1795-1798, and the streaming warning path (proxy_merge.goNewWarnSeriesResponseGRPCCodeFromWarn, storepb/custom.go:66-74) always sees the substring rpc error: code = ResourceExhausted embedded by the re-wrap, so its Unknown fallback cannot fire for limiter errors; (4) the querier maps ResourceExhaustedvalidation.LimitError → 422 (pkg/querier/blocks_store_queryable.go:733-748, using grpc-go status.FromError, which unwraps wrapped chains), and even a hypothetical wire Code(422) is preserved by the translator's 4xx branch (pkg/querier/error_translate_queryable.go:57-69); (5) the same CI job shows the 422 path working end-to-end (TestQuerierWithBlocksStorageLimits passed with nested ResourceExhausted statuses moments before the failure).

Why not other approaches

  • Why not an error-code-propagation fix (the original hypothesis)? See the audit above — intrinsic-status limiter redesigns, querier-side string matching, or translator catch-alls would not have prevented this flake (the failing query never carried a limiter error) and add misclassification risk: e.g. grpc-go itself emits ResourceExhausted for "message larger than max", which must stay 5xx, so remapping ResourceExhausted more aggressively in production risks mislabeling genuine server errors as 422.
  • Why not a production fix for the 500 itself? "No ACTIVE store-gateway owns the block" is intended fail-fast behavior: JOINING exists precisely to keep BlocksRead away until the initial blocks sync (pkg/storegateway/gateway_ring.go); RF>1 and query-frontend retries cover the window in production. Retrying/blocking GetClientsFor, admitting JOINING into BlocksRead, or gating querier readiness on ACTIVE store-gateways would mask outages or break bootstrap orderings for what is a test-synchronization gap.
  • Why not sleeps / bigger timeouts / e2e auto-retry? They shrink but don't close the window (the querier's watch lag is unbounded by any fixed sleep) and slow every run. The added wait is the exact predicate Ring.Get evaluates, observed from the same locked snapshot, so it closes the race rather than racing it more slowly.

Scope

  • Test-only: two identical waits in integration/querier_test.go + one CHANGELOG.md line. No production code, no vendored code, no e2e-framework changes, no shared helper.
  • The tests' assertions are untouched: 422 + "exceeded bytes limit" / "exceeded series limit" remain asserted on every query, so the limiter → ResourceExhaustedvalidation.LimitError → 422 path stays covered end-to-end.
  • Same root cause as Flaky test: TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, arm64) — transient 500 on the pre-deletion query #7605 (same decoded body class, same CI job): that issue is fixed separately for TestQuerierWithBlocksStorageOnMissingBlocksFromStorage. The two PRs touch disjoint hunks in this file (no shared diff context) and are merge-order independent; this PR deliberately does not touch that test.
  • Documented future work, deliberately not in this PR: an optional vendor-bump "tripwire" pin test of the limiter → warning → GRPCCodeFromWarn string contract. It passes today (there is no current bug), so if maintainers want it, it belongs in a separate tracking issue explicitly labeled not-a-current-bug.

Which issue(s) this PR fixes

Fixes #7606

Checklist

  • Tests updated (the two integration tests themselves; this is a test-only fix, so no new regression test is possible — evidence is the decoded CI failure bodies above)
  • Documentation added — N/A (no flags or config changed; make doc not required)
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan

Run locally on this branch:

  • gofmt -l integration/querier_test.go — clean
  • goimports -local github.com/cortexproject/cortex -l integration/querier_test.go — clean (no new imports; labels was already imported)
  • go vet -tags "slicelabels,integration,integration_querier" ./integration/ — clean
  • go test -c -tags "slicelabels,integration,integration_querier" -o /dev/null ./integration/ — compiles

Docker-based integration runs were not executed in this environment; for reviewers who want to reproduce:

  • Stress: go test -v -tags=integration,requires_docker,integration_querier -timeout 2400s -count=20 ./integration/ -run '^(TestQuerierWithStoreGatewayDataBytesLimits|TestQuerierWithBlocksStorageLimits)$'
  • Race amplification (fail-without-fix demo): add the querier flag -consul.watch-rate-limit=0.2 to these tests — unpatched they hit the 500 near-deterministically; with this PR's wait they pass.

AI assistance disclosure

Per GENAI_POLICY.md: the bulk of this contribution (root-cause investigation, code change, and PR description) was produced with AI assistance (Claude Code). It has been reviewed and validated by the submitter, who takes full responsibility for it.

🤖 Generated with Claude Code

…tore-gateway limits integration tests

TestQuerierWithStoreGatewayDataBytesLimits intermittently fails with
HTTP 500 instead of the expected 422 (cortexproject#7606, arm64 CI). The decoded
(gzipped) 500 response body from the failing run is the querier-local
ring error:

  expanding series: failed to get store-gateway replication set owning
  the block <ULID>: at least 1 healthy replica required, could only
  find 0 - unhealthy instances: 172.18.0.8:9095

i.e. the ring lookup failed before any store-gateway RPC was made. The
store-gateway registers in the ring as JOINING (already owning tokens)
and switches to ACTIVE only after its initial blocks sync, while the
querier's BlocksRead ring operation only selects ACTIVE instances and
its consul watch is rate-limited (1 rps by default). So the existing
waits (ring tokens registered, blocks loaded on the store-gateway) can
all pass while the querier's view of the store-gateway ring still says
JOINING, and the first query 500s.

The hypothesis originally filed on the issue - that the bytes-limit
error loses its 422/ResourceExhausted coding in the vendored Thanos
refetch ("series size exceeded expected size; refetching") path - was
falsified during investigation: those log lines belong to an earlier,
passing test in the same CI job; the failing query never reached
store-gateway limiter code at all; and all 10 vendored limiter
consumption sites (including the refetch recursion) re-code the error
as ResourceExhausted, which the querier maps to a 422 LimitError
(cortexproject#5286).

Fix the race in the tests by waiting until the querier sees the
store-gateway ACTIVE in its store-gateway ring view before querying
(same idiom as backward_compatibility_test.go, cortexproject#5975). Apply the same
wait to the sibling TestQuerierWithBlocksStorageLimits, which has the
identical vulnerable shape (every query expected to hit a 422 limit
against a freshly started store-gateway). Same root cause as cortexproject#7605,
which is fixed separately for
TestQuerierWithBlocksStorageOnMissingBlocksFromStorage in a
non-overlapping PR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/flaky-querier-sg-databytes-limits-sg-active-wait branch from 9fcac35 to 366ffc3 Compare June 11, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestQuerierWithStoreGatewayDataBytesLimits (integration_querier, arm64) — got 500 instead of expected 422

1 participant