Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RW Separation] Block Search Requests when All Search-Only Replicas are unassigned, with proper response #17424

Open
vinaykpud opened this issue Feb 21, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@vinaykpud
Copy link
Contributor

vinaykpud commented Feb 21, 2025

Describe the bug

When an index has at least one search-only replica (SR) and all search replicas are unassigned, the search request fails with a 503 status and a generic SearchPhaseExecutionException. Instead, the request should be blocked with a clear and valid response explaining the reason for the failure.

Related component

No response

To Reproduce

  1. Create a cluster with 3 nodes.
  2. Create an index with 1 primary shard (1P), 1 replica shard (1R), and 1 search-only replica (1SR).
  3. Index documents into the index.
  4. Perform a search request on the cluster.

The search request fails with the following response:

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": []
  },
  "status": 503
}

And the following exception is logged:

org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:775) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:395) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:815) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:548) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:290) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:373) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

Expected behavior

The search request should be blocked with a valid response that clearly explains the failure reason.

Ex: Search request failed because all search-only replicas are unassigned state. Also provide hint for user to ensure that search-only replicas are properly assigned to dedicated search nodes.

@mch2
Copy link
Member

mch2 commented Feb 26, 2025

@vinaykpud This is expected behavior - if all the shards capable of performing the search are unassigned it should return an all shards failed. In the case of RW split with search replicas, if there is at least one desired SR and it is unassigned it will only attempt to route to those shards.

However, We do need to cover the case where there are no SRs and someone issues a search. In this case I think we need to introduce an additional flag for strict/lenient enforcement where lenient would fall back to querying a writer. However, the point of separation is to have isolation, and we don't want primaries flooded with requests if that is not intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants