-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-19696. hadoop binary distribution to move cloud connectors to hadoop common/lib (#7980) #8094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-3.4
Are you sure you want to change the base?
Conversation
…hadoop common/lib (apache#7980) This moves all the cloud connector libraries to common/lib There are specific build options to control which libraries to include The hadoop-* JARs of the modules are includes, but dependencies are only included when the build-time options specify it. Available package profiles: hadoop-aliyun-package hadoop-aws-package hadoop-azure-datalake-package hadoop-cos-package hadoop-huaweicloud-package This means that by default AWS bundle.jar is no longer included in the distribution: to add it users must drop their chosen version of the SDK into share/hadoop/common/lib Anyone building their own release now has a choice of which connectors to bundle. The ASF ones will stay fairly lean to reduce the CVE attack surface as well as keep package size under control. Contributed by Steve Loughran
| mvn package -Pdist -DskipTests -Dhadoop-aws-package -Dhadoop-azure-datalake-package | ||
|
|
||
| Available package profiles: | ||
| hadoop-aws-package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore hadoop-aliyun-package
| ------------- | ||
|
|
||
| aopalliance:aopalliance:1.0 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut
| build with -Dhadoop-aws-package -Dhadoop-azure-datalake-package | ||
| Available package profiles: | ||
| hadoop-aws-package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore hadoop-aliyun-package docs
|
💔 -1 overall
This message was automatically generated. |
|
Hmm... I feel this is a surprising change for branch-3.4 |
|
@pan3793 I understand your concerns but this is actually a packaging improvement The current build
moving the hadoop-* cloud libs into common-lib but leaving out dependencies means the stuff is in the right place, provided users manually add the dependencies (which I'm not going to build with except for hadoop-azure as that's the httpclient and wildfly libs we use elsewhere. This makes releasing easier, and it makes adding the dependencies easier as the current setup requires a user to add the specific bundle-jar version the release was built with, same for the other components. That's why this change has been a blocker for a 3.4.3 release -the release process itself is what needs fixing |
|
@steveloughran I understand it's a little bit tricky for creating the lean tarball (should be similar to aarch tarball?), given that it already has a working script for it, I don't think it's a blocker for the 3.4 patching releases. TBH, I think Hadoop is currently kind of an abuse of patch releases, the recent patch releases contain more features than bug fixes, and even breaking changes. |
I've been trying to keep 3.4.3 low diff-wise to 3.4.2, balancing out the need for a lot of those transient CVE fixes. Other than anything related to avro updates, everything api-wise shouldn't be causing regressions. I'd like 3.4.3 to be the last java8 release, though I suspect we may need some dependency update releases next year. It's got a stabilisation of the aws analytics reader, but the only breaking change there is we change the default to "on"...people can switch back. maybe we should discuss this on common-dev? |
|
@steveloughran, I may not have enough background to comment on all the changes. Specific to this PR, I wonder
Compile/release is one-time stuff, while install/setup is more frequent, I would rather not have such a change in a patch release if it is not a blocker, but only makes the release process easier. Anyway, I don't have a binding veto, and I trust your authority in the Hadoop project. |
Replace all uses of the word "Li2cense" with "License" in hadoop-assemblies
correct
It doesn't set any of the new profiles, meaning the hadoop-* jars will be there, abfs will work (limited dependendencies) but anything else needs extra jars added. FWIW third-party cloud deployments (EMR, azure etc) all get their connectors onto the classpath, somehow. And because hadoop-aws has a bit of bash script in its src/main, somehow it gets into "bin/hadoop fs" commands...the shell script wiring up logic does this. |
I specifically mean the output of the command
https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/deployment/resource-providers/yarn/ but according to the answer to the next question, it seems fine.
@steveloughran, thanks for your detailed explanation. I don't have more questions |
|
@pan3793 happy to explain, and happy to hear your concerns. the build is set to keep out all the new transitive dependencies as much for security reasons as size: fewer dependencies, fewer dependabot alerts about CVEs, fewer jars to keep updating. but the hadoop-gcs and hadoop-cos (or is it hadoop-tos? it's 3.5+ only I think) both build shaded releases with all their dependencies in their JARs if you do a -Pdist build, which of course ASF releases do. This makes for bigger artifacts but not so big that they create a distribution problem...and that shading makes direct use of the fs through an import of the jar or hadoop-cloud-storage pom easier. Changing that packaging to not do the shading adds many, many more libraries to hadoop common (all enumerated and listed in LICENSE-binary), and would complicate use through maven declarations: classpath hell. Although I don't use Aliyun, tos, cos, volcano connectors myself, I don't want to do anything to stop them being used, and with these changes make it easy for people to build a hadoop-release with out-the-box support for them. FWIW, in cloudera we
|
This moves all the cloud connector libraries to common/lib There are specific build options to control which libraries to include The hadoop-* JARs of the modules are includes, but dependencies are only included when the build-time options specify it.
Available package profiles:
hadoop-aliyun-package
hadoop-aws-package
hadoop-azure-datalake-package
hadoop-cos-package
hadoop-huaweicloud-package
This means that by default AWS bundle.jar is no longer included in the distribution: to add it users must drop their chosen version of the SDK into share/hadoop/common/lib
Anyone building their own release now has a choice of which connectors to bundle. The ASF ones will stay fairly lean to reduce the CVE attack surface as well as keep package size under control.
This is the branch-3.4 variant which cuts out connector that are not present (tos, gcp).
How was this patch tested?
Manual builds; another in progress.
LICENSE-binary validated by looking at dependencie of hadoop-cloud-storage, making sure the needed ones were there and deleting some which didn't appear any more.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?