Skip to content

Flaky test: TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, arm64) — transient 500 on the pre-deletion query #7605

@sandy2008

Description

@sandy2008

AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.

Describe the bug

The integration_querier test TestQuerierWithBlocksStorageOnMissingBlocksFromStorage intermittently fails on arm64. The failure is on the first (happy-path) query, which runs before any block is deleted and is expected to succeed:

querier_test.go:367:
    Error: Received unexpected error:
           server_error: server error: 500
    Test:  TestQuerierWithBlocksStorageOnMissingBlocksFromStorage

Root cause (corrected 2026-06-11, from the decoded CI responses): the gzipped 500 response bodies in both CI runs decode to the same querier-local error — the query never reached the store-gateway:

expanding series: failed to get store-gateway replication set owning the block <ULID>:
at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095

Mechanism: the store-gateway registers in the ring in JOINING state with all 512 tokens, then runs its initial blocks sync, and only then flips to ACTIVE (pkg/storegateway/gateway.go). Every readiness condition the test waits on (querier/SG ring tokens, cortex_bucket_store_blocks_loaded == 1) is satisfiable while the store-gateway is still JOINING in the querier's ring view, but the BlocksRead operation only selects ACTIVE instances (pkg/storegateway/gateway_ring.go). The querier's consul watch (default rate limit 1/s) lags the JOINING→ACTIVE flip, so the first query can race ahead of it and fail with the error above.

This issue's original description speculated about a ResourceExhausted/series-limit error and a store-gateway readiness race; the series-limit log line turned out to belong to a neighboring (passing) test, and the failure is a test readiness gap, not a production bug.

To Reproduce

Steps to reproduce the behavior:

  1. Start Cortex (recent master)
  2. Run the integration test on arm64 (flaky; the window can be widened with a lower -consul.watch-rate-limit):
    go test -tags=slicelabels,integration,integration_querier -count=5 -run TestQuerierWithBlocksStorageOnMissingBlocksFromStorage ./integration/...
    

Expected behavior

The initial query (before block deletion) succeeds deterministically once the querier's own ring view sees the store-gateway as ACTIVE; the test should wait on that condition (the consumer's view) rather than only on producer-side metrics.

Environment:

  • Infrastructure: GitHub Actions CI, ubuntu-24.04-arm (arm64), integration job, tag integration_querier
  • Deployment tool: N/A (Docker-based integration test)

Additional Context

Observed on CI (both arm64, on PRs whose diff is unrelated to the querier):

Fix proposed in #7615 (adds a querier-side cortex_ring_members{name="store-gateway-client",state="ACTIVE"} wait, the same idiom PR #5975 used for the identical race in backward_compatibility_test.go). The same decoded root cause also explains #7606's failure in the same CI job.

Filed and later corrected from CI failure-log analysis with AI assistance; the run links, decoded response bodies, and cited code paths were reviewed and verified against master before submitting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions