TEZ-4547: Add Tez AM JobID to the JobConf #339

VenkatSNarayanan · 2024-03-13T18:04:03Z

Some committers require a job-wide UUID to function correctly. Adding the AM JobID to the JobConf
will allow applications to pass that to
the committers that need it.

Some committers require a job-wide UUID to function correctly. Adding the AM JobID to the JobConf will allow applications to pass that to the committers that need it.

r0hini · 2024-03-13T21:02:59Z

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/output/MROutput.java

@@ -417,6 +418,7 @@ protected List<Event> initializeBase() throws IOException, InterruptedException
        .createMockTaskAttemptID(getContext().getApplicationId().getClusterTimestamp(),
            getContext().getTaskVertexIndex(), getContext().getApplicationId().getId(),
            getContext().getTaskIndex(), getContext().getTaskAttemptNumber(), isMapperOutput);
+    jobConf.set(MRJobConfig.MR_PARENT_JOB_ID, new JobID(String.valueOf(getContext().getApplicationId().getClusterTimestamp()), getContext().getApplicationId().getId()).toString());


Move this to line above TaskAttemptID taskAttemptId =

r0hini · 2024-03-13T21:03:38Z

tez-mapreduce/src/test/java/org/apache/tez/mapreduce/output/TestMROutput.java

+    assertNotEquals(parentJobID,invalidJobID);
+    assertNotEquals(output.jobConf.get(org.apache.hadoop.mapred.JobContext.TASK_ATTEMPT_ID),parentJobID);


Fix code formatting. space after ,

shameersss1

If my understanding is correct, Hive/Pig would use the value from mapreduce.parent.job.id to set the correct committer UUID right?

shameersss1 · 2024-03-17T06:31:18Z

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/committer/MROutputCommitter.java

@@ -119,6 +119,7 @@ public void abortOutput(VertexStatus.State finalState) throws IOException {
        || jobConf.getBoolean("mapred.mapper.new-api", false))  {
      newApiCommitter = true;
    }
+    jobConf.set(MRJobConfig.MR_PARENT_JOB_ID, new org.apache.hadoop.mapred.JobID(String.valueOf(getContext().getApplicationId().getClusterTimestamp()), getContext().getApplicationId().getId()).toString());


Can we have String.valueOf(getContext().getApplicationId().getClusterTimestamp()), getContext().getApplicationId().getId()).toString( inside a method and re-use it in MROutput.java as well ?

VenkatSNarayanan · 2024-03-18T18:17:34Z

If my understanding is correct, Hive/Pig would use the value from mapreduce.parent.job.id to set the correct committer UUID right?

Yes, that was the plan. The property name was just chosen arbitrarily so I could put the PR up, any suggestions for a better one are welcome.

This commit also adds the DAG identifier to the job UUID to ensure that multiple jobs within the same session will be assigned different UUIDs.

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/committer/MROutputCommitter.java

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/common/Utils.java

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRJobConfig.java

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/output/MROutput.java

VenkatSNarayanan · 2024-04-05T05:52:05Z

@shameersss1 We could actually just set fs.s3a.committer.uuid directly instead of the indirection through the other setting.

Switch UUID property name to the one required by S3A committers.

shameersss1

LGTM +1

shameersss1 · 2024-04-05T07:10:05Z

@abstractdog - Could you please review the same ?

Refactors the implementation to reuse Tez's DAGID type instead of hand-rolling our own.

VenkatSNarayanan · 2024-05-07T20:05:48Z

@abstractdog @shameersss1 Is there anything else needed?

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRJobConfig.java

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/OutputCommitterContextImpl.java

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/committer/MROutputCommitter.java

abstractdog · 2024-06-20T07:01:16Z

left minor comments on this @VenkatSNarayanan , other than that, this looks good to me

tez-yetus · 2024-06-26T20:49:29Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	25m 52s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 1s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
		_ master Compile Tests _
+0 🆗	mvndep	6m 14s	Maven dependency ordering for branch
+1 💚	mvninstall	12m 57s	master passed
+1 💚	compile	1m 56s	master passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚	compile	1m 44s	master passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+1 💚	checkstyle	1m 58s	master passed
+1 💚	javadoc	1m 44s	master passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚	javadoc	1m 29s	master passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+0 🆗	spotbugs	1m 20s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	3m 47s	master passed
		_ Patch Compile Tests _
+0 🆗	mvndep	0m 10s	Maven dependency ordering for patch
+1 💚	mvninstall	1m 9s	the patch passed
+1 💚	compile	1m 17s	the patch passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚	javac	1m 17s	the patch passed
+1 💚	compile	1m 7s	the patch passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+1 💚	javac	1m 7s	the patch passed
-0 ⚠️	checkstyle	0m 13s	tez-api: The patch generated 1 new + 16 unchanged - 0 fixed = 17 total (was 16)
-0 ⚠️	checkstyle	0m 19s	tez-mapreduce: The patch generated 3 new + 368 unchanged - 0 fixed = 371 total (was 368)
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	javadoc	0m 52s	the patch passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚	javadoc	0m 52s	the patch passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+1 💚	findbugs	3m 5s	the patch passed
		_ Other Tests _
+1 💚	unit	2m 17s	tez-api in the patch passed.
+1 💚	unit	1m 23s	tez-mapreduce in the patch passed.
+1 💚	unit	5m 0s	tez-dag in the patch passed.
+1 💚	asflicense	0m 34s	The patch does not generate ASF License warnings.
		78m 8s

Subsystem	Report/Notes
Docker	ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-339/8/artifact/out/Dockerfile
GITHUB PR	#339
JIRA Issue	TEZ-4547
Optional Tests	dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname	Linux e2f1dda150af 5.15.0-106-generic #116-Ubuntu SMP Wed Apr 17 09:17:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/tez.sh
git revision	master / `19b2351`
Default Java	Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
checkstyle	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-339/8/artifact/out/diff-checkstyle-tez-api.txt
checkstyle	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-339/8/artifact/out/diff-checkstyle-tez-mapreduce.txt
Test Results	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-339/8/testReport/
Max. process+thread count	423 (vs. ulimit of 5500)
modules	C: tez-api tez-mapreduce tez-dag U: .
Console output	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-339/8/console
versions	git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

abstractdog · 2024-06-28T04:06:13Z

one more thing @VenkatSNarayanan , please address checkstyle comments where applicable, thanks!

tez-yetus · 2024-07-10T18:10:00Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 0s	Docker mode activated.
-1 ❌	docker	0m 20s	Docker failed to build yetus/tez:86b11997b.

Subsystem	Report/Notes
GITHUB PR	#339
JIRA Issue	TEZ-4547
Console output	https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-339/9/console
versions	git=2.34.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

r0hini · 2024-07-10T19:16:38Z

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRJobConfig.java

+  /**
+   * Used by committers to set a job-wide UUID.
+   */
+  public static final String JOB_COMMITTER_UUID = "job.committer.uuid";


This is not the setting used by s3 committer right? How will it work ?

There is a corresponding change I have in my Hadoop code where it will consult this property similar to how it consults the property Spark sets for this purpose.

so you can confirm this will work with job.committer.uuid, right?
can you link that point in hadoop code for later reference?

@VenkatSNarayanan May i ask that if the hadoop s3 committer can work with Hive+Tez after this change?
IMO, s3/magic committer can avoid some operation like rename on s3, which can speed up/improve hive job.

@abstractdog I haven't publicly posted the Hadoop PR yet, but the change I have is to check for this property around here: https://github.com/apache/hadoop/blob/51cb858cc8c23d873d4adfc21de5f2c1c22d346f/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L1372 similar to how the Spark property is checked. I have tested these changes together already alongside my Hive implementation.

@zhangbutao There are some corresponding changes to Hadoop and Hive that also need to be merged which I have. Once all 3 PRs(Tez, Hadoop and Hive have been merged), then the magic committer will be usable with Hive.

should this go into Tez 0.10.4? if so, it would be good to have it in 1-2 weeks, just FYI, regarding planning the hadoop change

0.10.4 would be ideal. In that case, let me loop in the Hadoop folks to see if they have any strong opinions about this.

@VenkatSNarayanan https://issues.apache.org/jira/browse/HIVE-16295 I found a old ticket about integrating s3a committer, and it seems that supporting this needs lots of Hive code change.
I am not sure if you have done similar change in Hive to support the MagicS3GuardCommitter.
Anyway, I think it is very good to support this committer in Hive&Tez. Look forward to your further work.
Thanks.

https://issues.apache.org/jira/browse/HADOOP-19091 I just saw your Hadoop ticket, and Hive change patch is also there too. Maybe you need create a PR against Hive latest master branch once you have done preparatory work. :)

There haven't been any objections from the Hadoop folks, I think it should be safe to go ahead with the patch as it is @abstractdog .

steveloughran

commented. all s3a committers save a json _SUCCESS file (parser in hadoop-aws for older hadoop releases, in hadoop-mapreduce more recently). you can verify job id end to end with this,

steveloughran · 2024-07-29T18:49:32Z

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/committer/MROutputCommitter.java

@@ -78,6 +79,7 @@ public void initialize() throws IOException {
    jobConf.getCredentials().mergeAll(UserGroupInformation.getCurrentUser().getCredentials());
    jobConf.setInt(MRJobConfig.APPLICATION_ATTEMPT_ID,
        getContext().getDAGAttemptNumber());
+    jobConf.set(MRJobConfig.JOB_COMMITTER_UUID, Utils.getDAGID(getContext()));


this unique across all jobs which may be writing to a table, even from other processes?

Yes. This ID is unique to a DAG + attempt number - so if we have some other job, it'll have a different application ID component, while if an attempt fails and the DAG retries, the attempt number will be different.

steveloughran · 2024-08-01T10:44:29Z

ok. if you look into _SUCCESS json from an s3a or the manifest committer, then the job id is one of the root attributes, as is the source

there's a java definition of this in org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.ManifestSuccessData in recent hadoop-mapreduce binaries

abstractdog · 2024-08-02T08:28:04Z

guys: @steveloughran , @VenkatSNarayanan : please let me know if this PR is fine to be merged to tez (from hadoop's point of view)? I'm about to start the release process of 0.10.4 soon
latest comment is that no objections, so I'm assuming we're fine with the current name of this config property

abstractdog · 2024-08-04T06:49:41Z

guys: @steveloughran , @VenkatSNarayanan : please let me know if this PR is fine to be merged to tez (from hadoop's point of view)? I'm about to start the release process of 0.10.4 soon latest comment is that no objections, so I'm assuming we're fine with the current name of this config property

FYI: I'm about to merge this tomorrow to have this in tez 0.10.4

TEZ-4547: Add Tez AM JobID to the JobConf

9040a2f

Some committers require a job-wide UUID to function correctly. Adding the AM JobID to the JobConf will allow applications to pass that to the committers that need it.

VenkatSNarayanan force-pushed the vnarayanan/TEZ-4547 branch from 25cae6d to 9040a2f Compare March 13, 2024 18:33