-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES|QL: Improve field resolution for FORK #128501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1994,15 +1994,15 @@ public void testForkFieldsWithKeepAfterFork() { | |||
(WHERE d > 1000 AND e == "aaa" | EVAL c = a + 200) | |||
| WHERE x > y | |||
| KEEP a, b, c, d, x | |||
""", Set.of("a", "a.*", "c", "c.*", "d", "d.*", "e", "e.*", "x", "x.*", "y", "y.*")); | |||
""", ALL_FIELDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically here what we had before was better
what we do now is that if we detect that a FORK branch requires all fields, we don't look further and we end up requiring all fields from field caps.
I think this is fine for now - we can look into improving this later - but at least the current implementation is less error prone than before.
(STATS x = count(*), y=min(z)) | ||
| LOOKUP JOIN my_lookup_index ON xyz | ||
| WHERE x > y OR _fork == "fork1" | ||
""", Set.of("a", "c", "abc", "b", "def", "z", "xyz", "def.*", "xyz.*", "z.*", "abc.*", "a.*", "c.*", "b.*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually got better - because we stopped asking field caps for _fork
and b
(which was the result of an eval).
@@ -2089,7 +2065,7 @@ public void testForkWithStatsInAllBranches() { | |||
(EVAL z = a * b | STATS m = max(z)) | |||
(STATS x = count(*), y=min(z)) | |||
| WHERE x > y | |||
""", Set.of("a", "a.*", "b", "b.*", "c", "c.*", "z", "z.*")); | |||
""", Set.of("a", "a.*", "c", "c.*", "z", "z.*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we correctly detect that now that we don't need to ask for b
because it's the result of an eval
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I have some questions about legibility on the plan traversal
parsed.forEachDown(p -> {// go over each plan top-down | ||
if (hasFork && seenFork.get() == false && p instanceof Fork == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this forEachDown loop:
- Shouldn't we check first if we have
seenFork
? Otherwise makes no sense to step down on the individual plans? - Why do we have to check for
seenFork
in the inner loop?
Maybe adding some comments would help me understand the plan traversal here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check first if we have seenFork? Otherwise makes no sense to step down on the individual plans?
If there is no FORK command, this loop will execute exactly as before. That's why I am checking hasFork
first, because the other conditions are relevant only in the context where FORK is being used.
Why do we have to check for seenFork in the inner loop?
This has to do with how FORK is being modelled.
A query like:
FROM test
| WHERE id > 1 // common pre-filter
| FORK
( WHERE content:"fox" )
( WHERE content:"dog" )
| SORT _fork
| KEEP _fork, id, content
If FORK is used we want to analyze the FORK branches separately - which is why we skip the plans in this forEachDown
loop until we reach FORK
. When we reach FORK
we call fieldNames
on each child and we try to union the field names.
This forEachDown
loop assumes the plan is linear (except for some LookupJoin
check).
If someone is making a change to fieldNames
, I wanted to make sure they don't have to necessarily worry about FORK
and whether the plan is linear or if it contains plans that have more than one child.
But I also didn't want us to introduce bugs with FORK because we did not consider it when we made changes to fieldNames
.
Which is why I put the handling of FORK right at the beginning and decided to call fieldNames
on each child.
I will add a comment in the code with some explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and I figured out the source of confusion here - we do traverse this with forEachDown
, not forEachUp
🤦♀️ 🤦♀️ 🤦♀️
I am just reverting to requesting all fields when FORK and not try to optimize this further which seems so brittle atm
when we are out to tech preview and we have amazing test coverage for FORK we can go back and modify this more confidently.
if (hasFork && seenFork.get() == false && p instanceof Fork == false) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition here seems to bypass all other plan types that are not fork
until a fork
is found. This seems brittle. I have the feeling this is like this because of the current limitations in fork
but I am not sure it's true. Nevertheless, if fork
evolves in the future and has no limitations anymore, this condition here still stands?
The condition makes everything else in the query dependent on fork
existence. Maybe some comments in code would explain why this logic makes sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is very brittle
I will just revert back to requesting all fields - we can improve this later and I'd argue it's not a must for tech preview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I think we can simplify some code
@@ -637,6 +616,8 @@ static PreAnalysisResult fieldNames(LogicalPlan parsed, Set<String> enrichPolicy | |||
|
|||
boolean[] canRemoveAliases = new boolean[] { true }; | |||
|
|||
PreAnalysisResult initialResult = result; | |||
projectAll.set(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - I think this assignment is unnecessary
@@ -711,6 +692,10 @@ static PreAnalysisResult fieldNames(LogicalPlan parsed, Set<String> enrichPolicy | |||
} | |||
}); | |||
|
|||
if (projectAll.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't happen, right? I don't see projectAll being updated in the previous forEachDown()
block 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fixes #128271
fixes #128272
related #121950
In #121950 we made some changes for how we do field resolution for FORK.
Previously we would ask for all fields when FORK was used in the query, which was suboptimal.
After I merged #121950, we got some failing tests in
IndexResolverFieldNamesTests
. That's because prior to me merging the #121950 there were some other fixes made inEsqlSession.fieldNames
that were merged and I did not update my branch with the latest changes.I am just reverting back to requesting all fields for FORK atm.