HADOOP-18891 hadoop distcp needs support to filter by file/directory attribute #6058

AuthurWang2009 · 2023-09-13T06:39:04Z

Description of PR

In some circumstances, we need to filter file/directory by file/directroy. For example, we need to filter out them by file modified time, isDir attrs, etc.

So, should we introduce a new method public boolean shouldCopy(CopyListingFileStatus fileStatus) ?
by this approach, we can introduce a more fluent way to do things than public abstract boolean shouldCopy(Path path).

To achieve the goal:
1、Create a method named shouldCopy(CopyListingFileStatus fileStatus) in CopyFilter abstract method, with a supportFileStatus() swtich method which return false by default.
2、For subclasses which impl the abstract class and want to use the new method, should overwrite shouldCopy(CopyListingFileStatus fileStatus) and for the same time, return supportFileStatus() to true.
3、This change is compatible with old use case.

As a impl:
1、I first create a abstract FileStatusCopyFilter extends CopyFilter
2、then create DirCopyFilter class extends FileStatusCopyFilter
3、and , implement UniformRecordInputFormat to support DirCopyFilter

How was this patch tested?

added unit tests

1、add distcp.filters.class=org.apache.hadoop.tools.DirCopyFilter to distcp-default.xml or set it by -Ddistcp.filters.class=org.apache.hadoop.tools.DirCopyFilter
2、then execute distcp commands

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?

Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…for multiple nameservices. Fixes apache#4584 Signed-off-by: Owen O'Malley <[email protected]>

* HADOOP-18001. Upgrade jetty version to 9.4.44 (apache#3700). Contributed by Yuan Luo. Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit b85c66a) * HADOOP-18333.Upgrade jetty version to 9.4.48.v20220622 (apache#4553) Co-authored-by: Ashutosh Gupta <[email protected]> (cherry picked from commit e664f81) Conflicts: LICENSE-binary Change-Id: I5a758df2551539c2780e170c3738c5b21eb0c79d Co-authored-by: better3471 <[email protected]> Co-authored-by: Ashutosh Gupta <[email protected]>

This addresses an issue where the plugin's default classpath for executing tests fails to include org.junit.platform.launcher.core.LauncherFactory. Contributed by: Steve Vaughan Jr

…che#4728) Declares its compatibility with Spark's dynamic output partitioning by having the stream capability "mapreduce.job.committer.dynamic.partitioning" Requires a Spark release with SPARK-40034, which does the probing before deciding whether to accept/rejecting instantiation with dynamic partition overwrite set This feature can be declared as supported by any other PathOutputCommitter implementations whose algorithm and destination filesystem are compatible. None of the S3A committers are compatible. The classic FileOutputCommitter is, but it does not declare itself as such out of our fear of changing that code. The Spark-side code will automatically infer compatibility if the created committer is of that class or a subclass. Contributed by Steve Loughran.

This reverts commit 1ff1210.

…s per user (apache#4542)

…oxy with multiple connections per user. Fixes apache#4748 Signed-off-by: Owen O'Malley <[email protected]>

…k report is delayed (apache#4756) Signed-off-by: Erik Krogen <[email protected]> (cherry picked from commit 231a446)

The JournalNodeSyncer will include the local instance in syncing when using a bind host (e.g. 0.0.0.0). There is a mechanism that is supposed to exclude the local instance, but it doesn't recognize the meta-address as a local address. Running with bind addresses set to 0.0.0.0, the JournalNodeSyncer will log attempts to sync with itself as part of the normal syncing rotation. For an HA configuration running 3 JournalNodes, the "other" list used by the JournalNodeSyncer will include 3 proxies. Exclude bound local addresses, including the use of a wildcard address in the bound host configurations, while still allowing multiple instances on the same host. Allow sync attempts with unresolved addresses, so that sync attempts can drive resolution as servers become available. Backport. Signed-off-by: stack <[email protected]>

…che#4652) (cherry picked from commit 22835be)

…ions (apache#4766) HADOOP-16202 "Enhance openFile()" added asynchronous draining of the remaining bytes of an S3 HTTP input stream for those operations (unbuffer, seek) where it could avoid blocking the active thread. This patch fixes the asynchronous stream draining to work and so return the stream back to the http pool. Without this, whenever unbuffer() or seek() was called on a stream and an asynchronous drain triggered, the connection was not returned; eventually the pool would be empty and subsequent S3 requests would fail with the message "Timeout waiting for connection from pool" The root cause was that even though the fields passed in to drain() were converted to references through the methods, in the lambda expression passed in to submit, they were direct references operation = client.submit( () -> drain(uri, streamStatistics, false, reason, remaining, object, wrappedStream)); /* here */ Those fields were only read during the async execution, at which point they would have been set to null (or even a subsequent read). A new SDKStreamDrainer class peforms the draining; this is a Callable and can be submitted directly to the executor pool. The class is used in both the classic and prefetching s3a input streams. Also, calling unbuffer() switches the S3AInputStream from adaptive to random IO mode; that is, it is considered a cue that future IO will not be sequential, whole-file reads. Contributed by Steve Loughran.

part of HADOOP-18103. Contributed By: Mukund Thakur

…rect buffers (apache#4787) part of HADOOP-18103. Contributed By: Mukund Thakur

… to unexpected host resolution (apache#4833) Use ".invalid" domain from IETF RFC 2606 to ensure that the host doesn't resolve. Contributed by Steve Vaughan Jr

…writes to disk. (apache#4669) Follow-up to HADOOP-12020 Support configuration of different S3 storage classes; S3 storage class is now set when buffering to heap/bytebuffers, and when creating directory markers Contributed by Monthon Klongklaew

…onfiguration (apache#4758) ITestAbfsManifestCommitProtocol to set requireRenameResilience to false for nonHNS configuration Contributed by Sree Bhattacharyya

@VisibleForTesting

…cs and ITestAbfsRestOperationException (apache#3699) Successor for the reverted PR apache#3341, using the hadoop @VisibleForTesting attribute Contributed by Sumangala Patki

…bled. (apache#4862) part of HADOOP-18103. While merging the ranges in CheckSumFs, they are rounded up based on the value of checksum bytes size which leads to some ranges crossing the EOF thus they need to be fixed else it will cause EOFException during actual reads. Contributed By: Mukund Thakur

…regation enabled (apache#4703) Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit 65a027b)

…equest (apache#4724) (apache#4794)

…lit when reading BZip2 text files (apache#4732) Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit 30c36ef)

…tion fails (apache#4896) The patch provides detailed diagnostics of file creation failure in LocalDirAllocator. Contributed by: Ashutosh Gupta

…t to 1.9 (apache#4916)

…en DN gets a reconstruction task (apache#4901) (cherry picked from commit 9a29075)

…4909) This problem surfaced in impala integration tests IMPALA-11592. TestLocalCatalogRetries.test_fetch_metadata_retry fails in S3 build after the change HADOOP-17461. Add thread-level IOStatistics Context The actual GC race condition came with HADOOP-18091. S3A auditing leaks memory through ThreadLocal references The fix for this is, if our hypothesis is correct, in WeakReferenceMap.create() where a strong reference to the new value is kept in a local variable *and referred to later* so that the JVM will not GC it. Along with the fix, extra assertions ensure that if the problem is not fixed, applications will fail faster/more meaningfully. Contributed by Steve Loughran.

…ash.moveToAppropriateTrash (apache#4869) * HADOOP-18444 Add Support for localized trash for ViewFileSystem in Trash.moveToAppropriateTrash Signed-off-by: Xing Lin <[email protected]>

Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>

…ing XML received from an untrusted source (apache#4886) Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit d9f435f)

Follow on to HADOOP-17461. Contributed by: Mehakmeet Singh

part of HADOOP-18103. Also introducing a config fs.s3a.vectored.active.ranged.reads to configure the maximum number of number of range reads a single input stream can have active (downloading, or queued) to the central FileSystem instance's pool of queued operations. This stops a single stream overloading the shared thread pool. Contributed by: Mukund Thakur Conflicts: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

… in DFSAdmin (apache#5667). Contributed by Haiyang Hu. Reviewed-by: Viraj Jasani <[email protected]> Signed-off-by: Ayush Saxena <[email protected]>

…estinationForPath (apache#5276) (apache#5423). Contributed by farmmamba Reviewed-by: Inigo Goiri <[email protected]> Reviewed-by: Tao Li <[email protected]> Signed-off-by: Ayush Saxena <[email protected]>

…system (apache#5704) This is a followup to HADOOP-18724. Open file fails with NumberFormatException for S3AFileSystem Contributed by Steve Loughran

(cherry picked from commit f68f1a4) Co-authored-by: zhengchenyu <[email protected]>

…d blocks from datanode (apache#5643). Contributed by hfutatzhanghb. Reviewed-by: Stephen O'Donnell <[email protected]> Reviewed-by: zhangshuyan <[email protected]> Signed-off-by: He Xiaoqiao <[email protected]> (cherry picked from commit 0e6bd09)

apache#5675) Contributed by Viraj Jasani

…iner path on yarn applications (apache#3908) Co-authored-by: Monthon Klongklaew <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>

…ache#5435) (apache#5729) This fixes CVE-2022-41881. This also upgrades io.opencensus dependencies to 0.12.3 Contributed by Aleksandr Nikolaev (cherry picked from commit 734f7ab) Conflicts: hadoop-project/pom.xml Change-Id: I26b8961725706370ac5f0fa248d0b0333034a047 Co-authored-by: nao <[email protected]>

… to reduce impact on heartbeat. (apache#5408). Contributed by ZhangHB. Signed-off-by: He Xiaoqiao <[email protected]>

This modifies the manifest committer so that the list of files to rename is passed between stages as a file of writeable entries on the local filesystem. The map of directories to create is still passed in memory; this map is built across all tasks, so even if many tasks created files, if they all write into the same set of directories the memory needed is O(directories) with the task count not a factor. The _SUCCESS file reports on heap size through gauges. This should give a warning if there are problems. Contributed by Steve Loughran

…uted by Dongjoon Hyun. Reviewed-by: Gautham B A <[email protected]> Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit fb16e00) Conflicts: hadoop-tools/hadoop-federation-balance/pom.xml

(cherry picked from commit e1bb4ac)

…eksandr Nikolaev. Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit acf82d4) (cherry picked from commit b8a76f6)

Signed-off-by: Brahma Reddy Battula <[email protected]> (cherry picked from commit 2e88096) Conflicts: LICENSE-binary (cherry picked from commit 50125e2)

Contributed By: Viraj Jasani Conflicts: LICENSE-binary (cherry picked from commit 4f6ebab)

hadoop-yetus · 2023-09-13T08:27:21Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	5m 35s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	xmllint	0m 0s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ branch-3.3.6 Compile Tests _
-1 ❌	mvninstall	48m 8s	/branch-mvninstall-root.txt	root in branch-3.3.6 failed.
+1 💚	compile	0m 24s		branch-3.3.6 passed
+1 💚	checkstyle	0m 23s		branch-3.3.6 passed
+1 💚	mvnsite	0m 28s		branch-3.3.6 passed
+1 💚	javadoc	0m 26s		branch-3.3.6 passed
+1 💚	spotbugs	0m 42s		branch-3.3.6 passed
+1 💚	shadedclient	16m 9s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 24s		the patch passed
+1 💚	compile	0m 18s		the patch passed
-1 ❌	javac	0m 18s	/results-compile-javac-hadoop-tools_hadoop-distcp.txt	hadoop-tools_hadoop-distcp generated 1 new + 9 unchanged - 0 fixed = 10 total (was 9)
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 14s	/results-checkstyle-hadoop-tools_hadoop-distcp.txt	hadoop-tools/hadoop-distcp: The patch generated 106 new + 4 unchanged - 0 fixed = 110 total (was 4)
+1 💚	mvnsite	0m 21s		the patch passed
+1 💚	javadoc	0m 16s		the patch passed
+1 💚	spotbugs	0m 39s		the patch passed
+1 💚	shadedclient	15m 57s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	13m 42s		hadoop-distcp in the patch passed.
+1 💚	asflicense	0m 28s		The patch does not generate ASF License warnings.
		107m 3s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/1/artifact/out/Dockerfile
GITHUB PR	#6058
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux 2080079e4014 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	branch-3.3.6 / `e030014`
Default Java	Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~18.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/1/testReport/
Max. process+thread count	552 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/1/console
versions	git=2.17.1 maven=3.6.0 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

AuthurWang2009 · 2023-09-13T08:43:13Z

@hadoop-yetus
I fixed Apache Yetus(jenkins) error: javac error

mvninstall error has nothing to do with my patch, due to nodejs.version is too low, so I update it to higher version

…nd update it to latest stable version

hadoop-yetus · 2023-09-13T11:36:23Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 28s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	xmllint	0m 0s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ branch-3.3.6 Compile Tests _
+0 🆗	mvndep	13m 16s		Maven dependency ordering for branch
-1 ❌	mvninstall	14m 53s	/branch-mvninstall-root.txt	root in branch-3.3.6 failed.
-1 ❌	compile	7m 23s	/branch-compile-root.txt	root in branch-3.3.6 failed.
+1 💚	checkstyle	1m 33s		branch-3.3.6 passed
+1 💚	mvnsite	0m 50s		branch-3.3.6 passed
+1 💚	javadoc	0m 45s		branch-3.3.6 passed
+0 🆗	spotbugs	0m 27s		branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 💚	shadedclient	16m 3s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 25s		Maven dependency ordering for patch
+1 💚	mvninstall	0m 36s		the patch passed
-1 ❌	compile	7m 12s	/patch-compile-root.txt	root in the patch failed.
-1 ❌	javac	7m 12s	/patch-compile-root.txt	root in the patch failed.
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 27s	/results-checkstyle-root.txt	root: The patch generated 106 new + 4 unchanged - 0 fixed = 110 total (was 4)
+1 💚	mvnsite	0m 45s		the patch passed
+1 💚	javadoc	0m 39s		the patch passed
+0 🆗	spotbugs	0m 19s		hadoop-project has no data from spotbugs
+1 💚	shadedclient	15m 59s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	0m 17s		hadoop-project in the patch passed.
+1 💚	unit	12m 52s		hadoop-distcp in the patch passed.
+1 💚	asflicense	0m 31s		The patch does not generate ASF License warnings.
		102m 18s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/2/artifact/out/Dockerfile
GITHUB PR	#6058
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle
uname	Linux f3207e1b9079 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	branch-3.3.6 / `fafa3c8`
Default Java	Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~18.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/2/testReport/
Max. process+thread count	570 (vs. ulimit of 5500)
modules	C: hadoop-project hadoop-tools/hadoop-distcp U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/2/console
versions	git=2.17.1 maven=3.6.0 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran

Hi,

Thank you for this. You should know that we are all scared of making changes to distCp and will need multiple reviewers on this. It is just a critical part of so much workflow and fairly old code.

Here is the review criteria I use to review things, afraid I am as strict on test code as production and like tests to stress of the fairly handling of everything as that's often the code path which is not tested in production.

https://github.com/steveloughran/formality/blob/master/styleguide/styleguide.md

So: embrace assertJ and add tests for the failure conditions your code looks for (files not found etc)

And we need the distcp documentation updated. thanks.

steveloughran · 2023-09-13T08:59:49Z

hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DirCopyFilter.java

+                return true;
+            }
+        } catch (IOException e) {
+            throw new RuntimeException(e);


RuntimeIOException

steveloughran · 2023-09-13T09:02:42Z

hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DirCopyFilter.java

+package org.apache.hadoop.tools;
+
+import java.io.IOException;
+import org.apache.hadoop.conf.Configuration;


afraid you need to use the same import ordering as everything else -its a real source of merge pain and we try to stay in control

java*

non-apache.*

org.apache.*

static *

steveloughran · 2023-09-13T09:03:13Z

hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DirCopyFilter.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class DirCopyFilter extends FileStatusCopyFilter {


needs javadocs, and know that our indentation policy is "two spaces"

steveloughran · 2023-09-13T09:04:26Z

hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyFilter.java

+    return shouldCopy(fileStatus.getPath());
+  }
+
+  public boolean supportFileStatus(){


what's this for? add javadoc to explain.

steveloughran · 2023-09-13T09:04:33Z

hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyFilter.java

@@ -46,6 +46,14 @@ public void initialize() {}
   */
  public abstract boolean shouldCopy(Path path);

+  public boolean shouldCopy(CopyListingFileStatus fileStatus){


needs javadoc

steveloughran · 2023-09-13T11:38:32Z