Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FederatedQueryPlanner #2216

Open
wants to merge 59 commits into
base: integration
Choose a base branch
from
Open

Conversation

lbschanno
Copy link
Collaborator

@lbschanno lbschanno commented Jan 12, 2024

Adds a FederatedQueryPlanner that will break up a query into multiple queries scanning over subsets of the original target date range if field index holes are identified to be present for the query in the target date range.

Note: the work in this PR is dependent on the work in:

Closes #825

lbschanno and others added 9 commits September 19, 2023 04:57
Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed
rows) in the metadata table such that the column qualifier contains the
event date. This is required as a first step to support efforts for
issue #825 so that we can identify dates when an event was ingested and
included in a frequency count for an associated 'f' row, but was not
indexed.
@lbschanno lbschanno changed the title Task/federated query planner Add FederatedQueryPlanner Jan 13, 2024
@ivakegg
Copy link
Collaborator

ivakegg commented Mar 13, 2024

From @lbschanno

I have been working on getting tests to pass when the FederatedQueryPlanner is the default query planner. Some cases have shown that it may not be enough to simply use the first config returned from a sub-query as the finalized query string.

For instance, in MaxExpansionIndexOnlyQueryTest.testMaxAnyField(), the sub-queries have the following results when the max value expansion threshold is set to 20:

Result 0 over 2015/04/04-2015/10/09
Query String: false
Query Data Iterable: Empty

Result 1 over 2015/10/10-2015/10/10
Query String: (CODE == 'b-code' || CITY == 'b-city' || CITY == 'b-2' || CITY == 'b-1' || STATE == 'b-state') && (CODE == 'a-code' || CITY == 'a-1' || STATE == 'a-state' || STATE == 'a-s2')
Query Data Iterable: contains 3 query datas

Sub-query 2 over 2015/10/11-2015/11/11
Query String: false
Query Data Iterable: empty

In MaxExpansionIndexOnlyQueryTest.testMaxValueRegexIndexOnly(), we receive the following sub-query results when the max expansion threshold is set to 20:

Sub-query 0 over 2015/04/04-2015/10/09
Query String: CITY == 'a-1' && STATE =~ 'b.*'
Query Data Iterable: Empty

Sub-query 1 over 2015/10/10-2015/10/10
Query String: CITY == 'a-1' && (STATE == 'b3-state' || STATE == 'b-state' || STATE == 'bi-s' || STATE == 'b2-state' || STATE == 'ba-s2')
Query Data Iterable: contains 2 query datas

Sub-query 2 over 2015/10/11-2015/11/11
Query String: CITY == 'a-1' && STATE =~ 'b.*'
Query Data Iterable: Empty

In MaxExpansionIndexOnlyQueryTest.testMaxValueNegAnyField(), we receive the following sub-query results when the max expansion threshold is set to 10.

Sub-query 0 over 2015/04/04-2015/10/09
Query String: false
Query Data Iterable: Empty

Sub-query 1 over 2015/10/10-2015/10/10
Query String: STATE == 'b-state' && !(((Delayed = true) && (ANYFIELD =~ 'a.*')) || CODE == 'a-code' || CITY == 'a-1' || STATE == 'a-state' || STATE == 'a-s2')
Query Data Iterable: contains 4 query datas

Sub-query 2 over 2015/10/11-2015/11/11
Query String: false
Query Data Iterable: Empty

I have seen some similar results for MaxExpansionQueryTest and AnyFieldQueryTest. Given that we can have differing query strings, how do you want to handle determining which query string to set in the original config that's passed into the FederatedQueryPlanner.process() method? Do we need the query string unique to the query data iterable returned from each sub-queries when setting up the schedulers in ShardQueryLogic.setUpQuery(GenericQueryConfiguration config)? Is there somewhere else where we need to know the specific query strings from each sub-query?

I have pushed an update that adds tests to MaxExpansionIndexOnlyQueryTest with versions of each test using either the DefaultQueryPlanner or FederatedQueryPlanner so that you can see the results for yourself.

I suppose another question would be do the results above even look correct to you?

@ivakegg
Copy link
Collaborator

ivakegg commented Mar 15, 2024

For documentation purposes, here is the response in the conversation we had:

  1. The tests that test the query plan would need to be changed to handle the one returned by the federated query planner.
  2. The federated query plan that gets returned should probably concatenate the plans (as a unique set) into something like this:
    ((plan = 1) && ()) || (plan = 2) && ()) ...
    simply use if only one plan is in the set
  3. We can work later to allow the query metrics to handle muliple top level plans after the sub-plan work is pulled in and proven viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle the cases were a field is both indexed and not indexed within a time range
3 participants