-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28165 HiveSplitGenerator: send splits through filesystem instead of RPC in case of big payload #5174
Conversation
6fd3057
to
f50490d
Compare
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
Outdated
Show resolved
Hide resolved
queryId = jobConf.get(HiveConf.ConfVars.HIVEQUERYID.varname); | ||
inputName = getContext().getInputName(); | ||
vertexId = getContext().getVertexId(); | ||
appStagingPath = TezCommonUtils.getTezSystemStagingPath(conf, getContext().getApplicationId().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that getContext()
cannot give back null?
The reason why I ask is I already found multiple code path in HiveSplitGenerator that assumes it gives back a null value. Example:
if (inputInitializerContext != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getContext() is not null, if it is, it's a bug and we're fine to fail
regarding other null check: I have no idea, it must be non-null here, provided by tez
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, this is not the only path to be refactored in HiveSplitGenerator: we have a conf and jobConf field as well which is confusing, I'll create a follow-up ticket to cleanup things here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share the follow-up tickets with me? I would be happy to help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops I just realized InputInitializerContext is null when the split generation happens outside the AM, e.g.:
2024-04-05T00:57:35,207 ERROR [HiveServer2-Background-Pool: Thread-1193] tez.HiveSplitGenerator: An exception was caught while running split generation, this is not recoverable
java.lang.NullPointerException: null
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator$SplitSerializer.<init>(HiveSplitGenerator.java:193) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.getSplitSerializer(HiveSplitGenerator.java:521) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.createEventList(HiveSplitGenerator.java:475) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:425) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.getSplits(GenericUDTFGetSplits.java:475) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.getSplitResult(GenericUDTFGetSplits.java:254) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits2.process(GenericUDTFGetSplits2.java:78) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
I'll handle it in the code
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
Outdated
Show resolved
Hide resolved
#### A masked pattern was here #### | ||
ABC 1 | ||
DEF 2 | ||
PREHOOK: query: select * from test_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no difference between the two select * output. Is the explain analyze provides better output to show what happens here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not supposed to have any changes, it should just work with the new serialized split the same way, so we're expecting the same results
this detail is not exposed via explain also
As you mentioned: "NOTE: this patch won't compile until TEZ-4548 is not released in Tez 0.10.4", I guess a pom.xml change would be necessary as well, but I don't see in the list of modified files. |
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had only minor questions, on overall, LGTM.
Tez 0.10.4 has not yet been released, not even planned yet, so this PR is blocked until that |
thanks a lot, I addressed what was possible in c260286 |
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1, pending tests
* It utilizes an ExecutorService for parallel writes to prevent a single split write operation | ||
* becoming the bottleneck (as write() is called from a loop currently). | ||
*/ | ||
class SplitSerializer implements AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could not be either private or static or final
- testing it from TestHiveSplitGenerator (private)
- jobConf cannot be accessed from static context (static)
- it's inherited in unit test (final)
@abstractdog, are we still waiting for the Tez changes to land? |
yes, unfortunately, we cannot merge this until tez 0.10.4 is released, that's why I set it blocked by: |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
NOTE: this patch won't compile until TEZ-4548 is not released in Tez 0.10.4, because it uses new API InputDataInformationEvent.createWithSerializedPath(...) and InputInitializerContext.getVertexId().
I ran a full recommit test against a downstream tez containing this change, that's how I got qtest and unit test results.
What changes were proposed in this pull request?
Serialize splits to filesystem according to their size.
Why are the changes needed?
To avoid holding too much split payload in memory, as they're are collected in a loop and returned as a list.
Does this PR introduce any user-facing change?
No.
Is the change a dependency upgrade?
No.
How was this patch tested?
Cluster testing:
Unit testing: