Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28165 HiveSplitGenerator: send splits through filesystem instead of RPC in case of big payload #5174

Closed
wants to merge 9 commits into from

Conversation

abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Mar 31, 2024

NOTE: this patch won't compile until TEZ-4548 is not released in Tez 0.10.4, because it uses new API InputDataInformationEvent.createWithSerializedPath(...) and InputInitializerContext.getVertexId().
I ran a full recommit test against a downstream tez containing this change, that's how I got qtest and unit test results.

What changes were proposed in this pull request?

Serialize splits to filesystem according to their size.

Why are the changes needed?

To avoid holding too much split payload in memory, as they're are collected in a loop and returned as a list.

Does this PR introduce any user-facing change?

No.

Is the change a dependency upgrade?

No.

How was this patch tested?

Cluster testing:

  • logs sanity-checked
  • OOM issue was resolved
    Unit testing:
  • hive unit tests
  • tez unit tests

queryId = jobConf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
inputName = getContext().getInputName();
vertexId = getContext().getVertexId();
appStagingPath = TezCommonUtils.getTezSystemStagingPath(conf, getContext().getApplicationId().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that getContext() cannot give back null?

The reason why I ask is I already found multiple code path in HiveSplitGenerator that assumes it gives back a null value. Example:

Copy link
Contributor Author

@abstractdog abstractdog Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getContext() is not null, if it is, it's a bug and we're fine to fail
regarding other null check: I have no idea, it must be non-null here, provided by tez

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, this is not the only path to be refactored in HiveSplitGenerator: we have a conf and jobConf field as well which is confusing, I'll create a follow-up ticket to cleanup things here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share the follow-up tickets with me? I would be happy to help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops I just realized InputInitializerContext is null when the split generation happens outside the AM, e.g.:

2024-04-05T00:57:35,207 ERROR [HiveServer2-Background-Pool: Thread-1193] tez.HiveSplitGenerator: An exception was caught while running split generation, this is not recoverable
java.lang.NullPointerException: null
	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator$SplitSerializer.<init>(HiveSplitGenerator.java:193) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.getSplitSerializer(HiveSplitGenerator.java:521) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.createEventList(HiveSplitGenerator.java:475) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:425) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.getSplits(GenericUDTFGetSplits.java:475) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.getSplitResult(GenericUDTFGetSplits.java:254) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits2.process(GenericUDTFGetSplits2.java:78) ~[hive-exec-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT]

I'll handle it in the code

#### A masked pattern was here ####
ABC 1
DEF 2
PREHOOK: query: select * from test_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no difference between the two select * output. Is the explain analyze provides better output to show what happens here?

Copy link
Contributor Author

@abstractdog abstractdog Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not supposed to have any changes, it should just work with the new serialized split the same way, so we're expecting the same results
this detail is not exposed via explain also

@zratkai
Copy link
Contributor

zratkai commented Apr 2, 2024

As you mentioned: "NOTE: this patch won't compile until TEZ-4548 is not released in Tez 0.10.4", I guess a pom.xml change would be necessary as well, but I don't see in the list of modified files.

Copy link
Contributor

@InvisibleProgrammer InvisibleProgrammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had only minor questions, on overall, LGTM.

@abstractdog
Copy link
Contributor Author

abstractdog commented Apr 2, 2024

As you mentioned: "NOTE: this patch won't compile until TEZ-4548 is not released in Tez 0.10.4", I guess a pom.xml change would be necessary as well, but I don't see in the list of modified files.

Tez 0.10.4 has not yet been released, not even planned yet, so this PR is blocked until that
so I could only expect a formal code review here, without providing the test results (what I did downstream btw)

@abstractdog
Copy link
Contributor Author

abstractdog commented Apr 2, 2024

I had only minor questions, on overall, LGTM.

thanks a lot, I addressed what was possible in c260286
and replied about the rest, please let me know if you can see anything else

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1, pending tests

* It utilizes an ExecutorService for parallel writes to prevent a single split write operation
* becoming the bottleneck (as write() is called from a loop currently).
*/
class SplitSerializer implements AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private static final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could not be either private or static or final

  1. testing it from TestHiveSplitGenerator (private)
  2. jobConf cannot be accessed from static context (static)
  3. it's inherited in unit test (final)

@deniskuzZ
Copy link
Member

@abstractdog, are we still waiting for the Tez changes to land?

@abstractdog
Copy link
Contributor Author

@abstractdog, are we still waiting for the Tez changes to land?

yes, unfortunately, we cannot merge this until tez 0.10.4 is released, that's why I set it blocked by:
https://issues.apache.org/jira/browse/TEZ-4556

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Jul 27, 2024
@github-actions github-actions bot closed this Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants