Skip to content

ES|QL: Improve field resolution for FORK #128501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 28, 2025

Conversation

ioanatia
Copy link
Contributor

@ioanatia ioanatia commented May 27, 2025

fixes #128271
fixes #128272
related #121950

In #121950 we made some changes for how we do field resolution for FORK.
Previously we would ask for all fields when FORK was used in the query, which was suboptimal.

After I merged #121950, we got some failing tests in IndexResolverFieldNamesTests. That's because prior to me merging the #121950 there were some other fixes made in EsqlSession.fieldNames that were merged and I did not update my branch with the latest changes.

I am just reverting back to requesting all fields for FORK atm.

@ioanatia ioanatia added >non-issue :Analytics/ES|QL AKA ESQL v9.1.0 Team:Search - Relevance The Search organization Search Relevance team labels May 27, 2025
@@ -1994,15 +1994,15 @@ public void testForkFieldsWithKeepAfterFork() {
(WHERE d > 1000 AND e == "aaa" | EVAL c = a + 200)
| WHERE x > y
| KEEP a, b, c, d, x
""", Set.of("a", "a.*", "c", "c.*", "d", "d.*", "e", "e.*", "x", "x.*", "y", "y.*"));
""", ALL_FIELDS);
Copy link
Contributor Author

@ioanatia ioanatia May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically here what we had before was better
what we do now is that if we detect that a FORK branch requires all fields, we don't look further and we end up requiring all fields from field caps.
I think this is fine for now - we can look into improving this later - but at least the current implementation is less error prone than before.

(STATS x = count(*), y=min(z))
| LOOKUP JOIN my_lookup_index ON xyz
| WHERE x > y OR _fork == "fork1"
""", Set.of("a", "c", "abc", "b", "def", "z", "xyz", "def.*", "xyz.*", "z.*", "abc.*", "a.*", "c.*", "b.*"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually got better - because we stopped asking field caps for _fork and b (which was the result of an eval).

@@ -2089,7 +2065,7 @@ public void testForkWithStatsInAllBranches() {
(EVAL z = a * b | STATS m = max(z))
(STATS x = count(*), y=min(z))
| WHERE x > y
""", Set.of("a", "a.*", "b", "b.*", "c", "c.*", "z", "z.*"));
""", Set.of("a", "a.*", "c", "c.*", "z", "z.*"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we correctly detect that now that we don't need to ask for b because it's the result of an eval

@ioanatia ioanatia marked this pull request as ready for review May 27, 2025 11:19
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed Team:Search - Relevance The Search organization Search Relevance team labels May 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I have some questions about legibility on the plan traversal

parsed.forEachDown(p -> {// go over each plan top-down
if (hasFork && seenFork.get() == false && p instanceof Fork == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this forEachDown loop:

  • Shouldn't we check first if we have seenFork? Otherwise makes no sense to step down on the individual plans?
  • Why do we have to check for seenFork in the inner loop?

Maybe adding some comments would help me understand the plan traversal here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check first if we have seenFork? Otherwise makes no sense to step down on the individual plans?

If there is no FORK command, this loop will execute exactly as before. That's why I am checking hasFork first, because the other conditions are relevant only in the context where FORK is being used.

Why do we have to check for seenFork in the inner loop?

This has to do with how FORK is being modelled.
A query like:

FROM test
| WHERE id > 1  // common pre-filter
| FORK
    ( WHERE content:"fox" )
    ( WHERE content:"dog" )
| SORT _fork
| KEEP _fork, id, content

is parsed as:
Screenshot 2025-05-27 at 14 49 26

If FORK is used we want to analyze the FORK branches separately - which is why we skip the plans in this forEachDown loop until we reach FORK. When we reach FORK we call fieldNames on each child and we try to union the field names.

This forEachDown loop assumes the plan is linear (except for some LookupJoin check).
If someone is making a change to fieldNames, I wanted to make sure they don't have to necessarily worry about FORK and whether the plan is linear or if it contains plans that have more than one child.
But I also didn't want us to introduce bugs with FORK because we did not consider it when we made changes to fieldNames.
Which is why I put the handling of FORK right at the beginning and decided to call fieldNames on each child.

I will add a comment in the code with some explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and I figured out the source of confusion here - we do traverse this with forEachDown, not forEachUp 🤦‍♀️ 🤦‍♀️ 🤦‍♀️

I am just reverting to requesting all fields when FORK and not try to optimize this further which seems so brittle atm

when we are out to tech preview and we have amazing test coverage for FORK we can go back and modify this more confidently.

Comment on lines 620 to 622
if (hasFork && seenFork.get() == false && p instanceof Fork == false) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition here seems to bypass all other plan types that are not fork until a fork is found. This seems brittle. I have the feeling this is like this because of the current limitations in fork but I am not sure it's true. Nevertheless, if fork evolves in the future and has no limitations anymore, this condition here still stands?

The condition makes everything else in the query dependent on fork existence. Maybe some comments in code would explain why this logic makes sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is very brittle
I will just revert back to requesting all fields - we can improve this later and I'd argue it's not a must for tech preview

ioanatia and others added 2 commits May 27, 2025 18:04
@ioanatia ioanatia requested review from astefan and carlosdelest May 28, 2025 07:03
Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I think we can simplify some code

@@ -637,6 +616,8 @@ static PreAnalysisResult fieldNames(LogicalPlan parsed, Set<String> enrichPolicy

boolean[] canRemoveAliases = new boolean[] { true };

PreAnalysisResult initialResult = result;
projectAll.set(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - I think this assignment is unnecessary

@@ -711,6 +692,10 @@ static PreAnalysisResult fieldNames(LogicalPlan parsed, Set<String> enrichPolicy
}
});

if (projectAll.get()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't happen, right? I don't see projectAll being updated in the previous forEachDown() block 🤔

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ioanatia ioanatia merged commit f275b71 into elastic:main May 28, 2025
18 checks passed
@ioanatia ioanatia mentioned this pull request May 28, 2025
19 tasks
Samiul-TheSoccerFan pushed a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0
Projects
None yet
4 participants