Skip to content

HADOOP-18891 hadoop distcp needs support to filter by file/directory attribute #6058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1,580 commits into from

Conversation

AuthurWang2009
Copy link

@AuthurWang2009 AuthurWang2009 commented Sep 13, 2023

Description of PR

In some circumstances, we need to filter file/directory by file/directroy. For example, we need to filter out them by file modified time, isDir attrs, etc.

So, should we introduce a new method public boolean shouldCopy(CopyListingFileStatus fileStatus) ?
by this approach, we can introduce a more fluent way to do things than public abstract boolean shouldCopy(Path path).

To achieve the goal:
1、Create a method named shouldCopy(CopyListingFileStatus fileStatus) in CopyFilter abstract method, with a supportFileStatus() swtich method which return false by default.
2、For subclasses which impl the abstract class and want to use the new method, should overwrite shouldCopy(CopyListingFileStatus fileStatus) and for the same time, return supportFileStatus() to true.
3、This change is compatible with old use case.

As a impl:
1、I first create a abstract FileStatusCopyFilter extends CopyFilter
2、then create DirCopyFilter class extends FileStatusCopyFilter
3、and , implement UniformRecordInputFormat to support DirCopyFilter

How was this patch tested?

added unit tests

1、add distcp.filters.class=org.apache.hadoop.tools.DirCopyFilter to distcp-default.xml or set it by -Ddistcp.filters.class=org.apache.hadoop.tools.DirCopyFilter
2、then execute distcp commands

For code changes:

  1. Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

simbadzina and others added 30 commits August 23, 2022 11:27
…for multiple nameservices.

Fixes apache#4584

Signed-off-by: Owen O'Malley <[email protected]>
* HADOOP-18001. Upgrade jetty version to 9.4.44 (apache#3700). Contributed by Yuan Luo.

Signed-off-by: Ayush Saxena <[email protected]>
(cherry picked from commit b85c66a)

* HADOOP-18333.Upgrade jetty version to 9.4.48.v20220622 (apache#4553)

Co-authored-by: Ashutosh Gupta <[email protected]>
(cherry picked from commit e664f81)

 Conflicts:
	LICENSE-binary

Change-Id: I5a758df2551539c2780e170c3738c5b21eb0c79d

Co-authored-by: better3471 <[email protected]>
Co-authored-by: Ashutosh Gupta <[email protected]>
This addresses an issue where the plugin's default classpath
for executing tests fails to include
org.junit.platform.launcher.core.LauncherFactory.

Contributed by: Steve Vaughan Jr
…che#4728)

Declares its compatibility with Spark's dynamic
output partitioning by having the stream capability
"mapreduce.job.committer.dynamic.partitioning"

Requires a Spark release with SPARK-40034, which
does the probing before deciding whether to
accept/rejecting instantiation with
dynamic partition overwrite set

This feature can be declared as supported by
any other PathOutputCommitter implementations
whose algorithm and destination filesystem
are compatible.

None of the S3A committers are compatible.

The classic FileOutputCommitter is, but it
does not declare itself as such out of our fear
of changing that code. The Spark-side code
will automatically infer compatibility if
the created committer is of that class or
a subclass.

Contributed by Steve Loughran.
…oxy with multiple connections per user.

Fixes apache#4748

Signed-off-by: Owen O'Malley <[email protected]>
…k report is delayed (apache#4756)

Signed-off-by: Erik Krogen <[email protected]>

(cherry picked from commit 231a446)
The JournalNodeSyncer will include the local instance in syncing when using a bind host (e.g. 0.0.0.0).  There is a mechanism that is supposed to exclude the local instance, but it doesn't recognize the meta-address as a local address.

Running with bind addresses set to 0.0.0.0, the JournalNodeSyncer will log attempts to sync with itself as part of the normal syncing rotation.  For an HA configuration running 3 JournalNodes, the "other" list used by the JournalNodeSyncer will include 3 proxies.

Exclude bound local addresses, including the use of a wildcard address in the bound host configurations, while still allowing multiple instances on the same host.

Allow sync attempts with unresolved addresses, so that sync attempts can drive resolution as servers become available.

Backport.
Signed-off-by: stack <[email protected]>
…ions (apache#4766)


HADOOP-16202 "Enhance openFile()" added asynchronous draining of the
remaining bytes of an S3 HTTP input stream for those operations
(unbuffer, seek) where it could avoid blocking the active
thread.

This patch fixes the asynchronous stream draining to work and so
return the stream back to the http pool. Without this, whenever
unbuffer() or seek() was called on a stream and an asynchronous
drain triggered, the connection was not returned; eventually
the pool would be empty and subsequent S3 requests would
fail with the message "Timeout waiting for connection from pool"

The root cause was that even though the fields passed in to drain() were
converted to references through the methods, in the lambda expression
passed in to submit, they were direct references

operation = client.submit(
 () -> drain(uri, streamStatistics,
       false, reason, remaining,
       object, wrappedStream));  /* here */

Those fields were only read during the async execution, at which
point they would have been set to null (or even a subsequent read).

A new SDKStreamDrainer class peforms the draining; this is a Callable
and can be submitted directly to the executor pool.

The class is used in both the classic and prefetching s3a input streams.

Also, calling unbuffer() switches the S3AInputStream from adaptive
to random IO mode; that is, it is considered a cue that future
IO will not be sequential, whole-file reads.

Contributed by Steve Loughran.
part of HADOOP-18103.

Contributed By: Mukund Thakur
…rect buffers (apache#4787)

part of HADOOP-18103.

Contributed By: Mukund Thakur
… to unexpected host resolution (apache#4833)

Use ".invalid" domain from IETF RFC 2606 to ensure that the host doesn't resolve.

Contributed by Steve Vaughan Jr
…writes to disk. (apache#4669)

Follow-up to HADOOP-12020 Support configuration of different S3 storage classes;
S3 storage class is now set when buffering to heap/bytebuffers, and when
creating directory markers

Contributed by Monthon Klongklaew
…onfiguration (apache#4758)

ITestAbfsManifestCommitProtocol  to set requireRenameResilience to false for nonHNS configuration

Contributed by Sree Bhattacharyya
…cs and ITestAbfsRestOperationException (apache#3699)

Successor for the reverted PR apache#3341, using the hadoop @VisibleForTesting attribute

Contributed by Sumangala Patki
…bled. (apache#4862)


part of HADOOP-18103.

While merging the ranges in CheckSumFs, they are rounded up based on the
value of checksum bytes size which leads to some ranges crossing the EOF
thus they need to be fixed else it will cause EOFException during actual reads.

Contributed By: Mukund Thakur
…regation enabled (apache#4703)

Co-authored-by: Ashutosh Gupta <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
(cherry picked from commit 65a027b)
…lit when reading BZip2 text files (apache#4732)

Co-authored-by: Ashutosh Gupta <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
(cherry picked from commit 30c36ef)
…tion fails (apache#4896)


The patch provides detailed diagnostics of file creation failure in LocalDirAllocator.

Contributed by: Ashutosh Gupta
…en DN gets a reconstruction task (apache#4901)

(cherry picked from commit 9a29075)
…4909)

This problem surfaced in impala integration tests
   IMPALA-11592. TestLocalCatalogRetries.test_fetch_metadata_retry fails in S3 build
after the change
  HADOOP-17461. Add thread-level IOStatistics Context
The actual GC race condition came with
 HADOOP-18091. S3A auditing leaks memory through ThreadLocal references

The fix for this is, if our hypothesis is correct, in WeakReferenceMap.create()
where a strong reference to the new value is kept in a local variable
*and referred to later* so that the JVM will not GC it.

Along with the fix, extra assertions ensure that if the problem is not fixed,
applications will fail faster/more meaningfully.

Contributed by Steve Loughran.
…ash.moveToAppropriateTrash (apache#4869)

* HADOOP-18444 Add Support for localized trash for ViewFileSystem in Trash.moveToAppropriateTrash

Signed-off-by: Xing Lin <[email protected]>
Co-authored-by: Ashutosh Gupta <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
…ing XML received from an untrusted source (apache#4886)

Co-authored-by: Ashutosh Gupta <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
(cherry picked from commit d9f435f)
Follow on to HADOOP-17461.

Contributed by: Mehakmeet Singh
part of HADOOP-18103.

Also introducing a config fs.s3a.vectored.active.ranged.reads
to configure the maximum number of number of range reads a
single input stream can have active (downloading, or queued)
to the central FileSystem instance's pool of queued operations.
This stops a single stream overloading the shared thread pool.

Contributed by: Mukund Thakur
 Conflicts:
	hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
	hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
haiyang1987 and others added 19 commits May 21, 2023 09:06
… in DFSAdmin (apache#5667). Contributed by Haiyang Hu.

Reviewed-by: Viraj Jasani <[email protected]>
Signed-off-by: Ayush Saxena <[email protected]>
…estinationForPath (apache#5276) (apache#5423). Contributed by farmmamba

Reviewed-by: Inigo Goiri <[email protected]>
Reviewed-by: Tao Li <[email protected]>
Signed-off-by: Ayush Saxena <[email protected]>
…system (apache#5704)

This is a followup to
HADOOP-18724. Open file fails with NumberFormatException for S3AFileSystem

Contributed by Steve Loughran
…d blocks from datanode (apache#5643). Contributed by hfutatzhanghb.

Reviewed-by: Stephen O'Donnell <[email protected]>
Reviewed-by: zhangshuyan <[email protected]>
Signed-off-by: He Xiaoqiao <[email protected]>
(cherry picked from commit 0e6bd09)
…iner path on yarn applications (apache#3908)

Co-authored-by: Monthon Klongklaew <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
…ache#5435) (apache#5729)

This fixes CVE-2022-41881.

This also upgrades io.opencensus dependencies to 0.12.3

Contributed by Aleksandr Nikolaev

(cherry picked from commit 734f7ab)

 Conflicts:
	hadoop-project/pom.xml

Change-Id: I26b8961725706370ac5f0fa248d0b0333034a047

Co-authored-by: nao <[email protected]>
… to reduce impact on heartbeat. (apache#5408). Contributed by ZhangHB.

Signed-off-by: He Xiaoqiao <[email protected]>
This modifies the manifest committer so that the list of files
to rename is passed between stages as a file of
writeable entries on the local filesystem.

The map of directories to create is still passed in memory;
this map is built across all tasks, so even if many tasks
created files, if they all write into the same set of directories
the memory needed is O(directories) with the
task count not a factor.

The _SUCCESS file reports on heap size through gauges.
This should give a warning if there are problems.

Contributed by Steve Loughran
…uted by Dongjoon Hyun.

Reviewed-by: Gautham B A <[email protected]>
Signed-off-by: Ayush Saxena <[email protected]>
(cherry picked from commit fb16e00)

Conflicts:
	hadoop-tools/hadoop-federation-balance/pom.xml
…eksandr Nikolaev.

Signed-off-by: Ayush Saxena <[email protected]>
(cherry picked from commit acf82d4)
(cherry picked from commit b8a76f6)
Signed-off-by: Brahma Reddy Battula <[email protected]>
(cherry picked from commit 2e88096)

Conflicts:
	LICENSE-binary
(cherry picked from commit 50125e2)
Contributed By: Viraj Jasani
 Conflicts:
	LICENSE-binary

(cherry picked from commit 4f6ebab)
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 5m 35s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ branch-3.3.6 Compile Tests _
-1 ❌ mvninstall 48m 8s /branch-mvninstall-root.txt root in branch-3.3.6 failed.
+1 💚 compile 0m 24s branch-3.3.6 passed
+1 💚 checkstyle 0m 23s branch-3.3.6 passed
+1 💚 mvnsite 0m 28s branch-3.3.6 passed
+1 💚 javadoc 0m 26s branch-3.3.6 passed
+1 💚 spotbugs 0m 42s branch-3.3.6 passed
+1 💚 shadedclient 16m 9s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 24s the patch passed
+1 💚 compile 0m 18s the patch passed
-1 ❌ javac 0m 18s /results-compile-javac-hadoop-tools_hadoop-distcp.txt hadoop-tools_hadoop-distcp generated 1 new + 9 unchanged - 0 fixed = 10 total (was 9)
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 14s /results-checkstyle-hadoop-tools_hadoop-distcp.txt hadoop-tools/hadoop-distcp: The patch generated 106 new + 4 unchanged - 0 fixed = 110 total (was 4)
+1 💚 mvnsite 0m 21s the patch passed
+1 💚 javadoc 0m 16s the patch passed
+1 💚 spotbugs 0m 39s the patch passed
+1 💚 shadedclient 15m 57s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 13m 42s hadoop-distcp in the patch passed.
+1 💚 asflicense 0m 28s The patch does not generate ASF License warnings.
107m 3s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/1/artifact/out/Dockerfile
GITHUB PR #6058
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname Linux 2080079e4014 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision branch-3.3.6 / e030014
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu118.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/1/testReport/
Max. process+thread count 552 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/1/console
versions git=2.17.1 maven=3.6.0 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@AuthurWang2009
Copy link
Author

@hadoop-yetus
I fixed Apache Yetus(jenkins) error: javac error

mvninstall error has nothing to do with my patch, due to nodejs.version is too low, so I update it to higher version

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ branch-3.3.6 Compile Tests _
+0 🆗 mvndep 13m 16s Maven dependency ordering for branch
-1 ❌ mvninstall 14m 53s /branch-mvninstall-root.txt root in branch-3.3.6 failed.
-1 ❌ compile 7m 23s /branch-compile-root.txt root in branch-3.3.6 failed.
+1 💚 checkstyle 1m 33s branch-3.3.6 passed
+1 💚 mvnsite 0m 50s branch-3.3.6 passed
+1 💚 javadoc 0m 45s branch-3.3.6 passed
+0 🆗 spotbugs 0m 27s branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 💚 shadedclient 16m 3s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 25s Maven dependency ordering for patch
+1 💚 mvninstall 0m 36s the patch passed
-1 ❌ compile 7m 12s /patch-compile-root.txt root in the patch failed.
-1 ❌ javac 7m 12s /patch-compile-root.txt root in the patch failed.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 27s /results-checkstyle-root.txt root: The patch generated 106 new + 4 unchanged - 0 fixed = 110 total (was 4)
+1 💚 mvnsite 0m 45s the patch passed
+1 💚 javadoc 0m 39s the patch passed
+0 🆗 spotbugs 0m 19s hadoop-project has no data from spotbugs
+1 💚 shadedclient 15m 59s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 0m 17s hadoop-project in the patch passed.
+1 💚 unit 12m 52s hadoop-distcp in the patch passed.
+1 💚 asflicense 0m 31s The patch does not generate ASF License warnings.
102m 18s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/2/artifact/out/Dockerfile
GITHUB PR #6058
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle
uname Linux f3207e1b9079 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision branch-3.3.6 / fafa3c8
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu118.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/2/testReport/
Max. process+thread count 570 (vs. ulimit of 5500)
modules C: hadoop-project hadoop-tools/hadoop-distcp U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6058/2/console
versions git=2.17.1 maven=3.6.0 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thank you for this. You should know that we are all scared of making changes to distCp and will need multiple reviewers on this. It is just a critical part of so much workflow and fairly old code.

Here is the review criteria I use to review things, afraid I am as strict on test code as production and like tests to stress of the fairly handling of everything as that's often the code path which is not tested in production.

https://github.com/steveloughran/formality/blob/master/styleguide/styleguide.md

So: embrace assertJ and add tests for the failure conditions your code looks for (files not found etc)

And we need the distcp documentation updated. thanks.

return true;
}
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RuntimeIOException

package org.apache.hadoop.tools;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afraid you need to use the same import ordering as everything else -its a real source of merge pain and we try to stay in control

java*

non-apache.*

org.apache.*

static *

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class DirCopyFilter extends FileStatusCopyFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs javadocs, and know that our indentation policy is "two spaces"

return shouldCopy(fileStatus.getPath());
}

public boolean supportFileStatus(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for? add javadoc to explain.

@@ -46,6 +46,14 @@ public void initialize() {}
*/
public abstract boolean shouldCopy(Path path);

public boolean shouldCopy(CopyListingFileStatus fileStatus){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs javadoc

doubleCheckedTotalSize += currentSplitSize;
}

Assert.assertEquals(totalFileSize, doubleCheckedTotalSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, assertJ here and everywhere below

return size;
}
finally {
IOUtils.cleanupWithLogger(null, fileSystem, outputStream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't close the filesystem; it is needed for the other tests

}

@AfterClass
public static void tearDown() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. rename tearDownClass(); move up to just below the setup method
  2. wrap the shutdown with if (cluster != null) {...}

FileSystem fs = sourcePath.getFileSystem(configuration);
FileStatus fileStatus [] = fs.listStatus(sourcePath);
if (fileStatus.length > 1) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to explain

}

//Verify there is nothing more to read from the input file
SequenceFile.Reader reader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or use a try-with-resources for the automatic close

@AuthurWang2009 AuthurWang2009 changed the base branch from branch-3.3.6 to trunk September 14, 2023 08:39
@AuthurWang2009 AuthurWang2009 marked this pull request as draft September 14, 2023 13:18
@AuthurWang2009
Copy link
Author

I will reopen a pull request, and merge the pull request to the trunk.

@steveloughran
Copy link
Contributor

thanks. link to it from here so i'll see it

@AuthurWang2009
Copy link
Author

AuthurWang2009 commented Sep 15, 2023

I made a new PR 6070 , @steveloughran

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.