-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-18891 hadoop distcp needs support to filter by file/directory attribute #6058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…for multiple nameservices. Fixes apache#4584 Signed-off-by: Owen O'Malley <[email protected]>
* HADOOP-18001. Upgrade jetty version to 9.4.44 (apache#3700). Contributed by Yuan Luo. Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit b85c66a) * HADOOP-18333.Upgrade jetty version to 9.4.48.v20220622 (apache#4553) Co-authored-by: Ashutosh Gupta <[email protected]> (cherry picked from commit e664f81) Conflicts: LICENSE-binary Change-Id: I5a758df2551539c2780e170c3738c5b21eb0c79d Co-authored-by: better3471 <[email protected]> Co-authored-by: Ashutosh Gupta <[email protected]>
This addresses an issue where the plugin's default classpath for executing tests fails to include org.junit.platform.launcher.core.LauncherFactory. Contributed by: Steve Vaughan Jr
…che#4728) Declares its compatibility with Spark's dynamic output partitioning by having the stream capability "mapreduce.job.committer.dynamic.partitioning" Requires a Spark release with SPARK-40034, which does the probing before deciding whether to accept/rejecting instantiation with dynamic partition overwrite set This feature can be declared as supported by any other PathOutputCommitter implementations whose algorithm and destination filesystem are compatible. None of the S3A committers are compatible. The classic FileOutputCommitter is, but it does not declare itself as such out of our fear of changing that code. The Spark-side code will automatically infer compatibility if the created committer is of that class or a subclass. Contributed by Steve Loughran.
This reverts commit 1ff1210.
…oxy with multiple connections per user. Fixes apache#4748 Signed-off-by: Owen O'Malley <[email protected]>
…k report is delayed (apache#4756) Signed-off-by: Erik Krogen <[email protected]> (cherry picked from commit 231a446)
The JournalNodeSyncer will include the local instance in syncing when using a bind host (e.g. 0.0.0.0). There is a mechanism that is supposed to exclude the local instance, but it doesn't recognize the meta-address as a local address. Running with bind addresses set to 0.0.0.0, the JournalNodeSyncer will log attempts to sync with itself as part of the normal syncing rotation. For an HA configuration running 3 JournalNodes, the "other" list used by the JournalNodeSyncer will include 3 proxies. Exclude bound local addresses, including the use of a wildcard address in the bound host configurations, while still allowing multiple instances on the same host. Allow sync attempts with unresolved addresses, so that sync attempts can drive resolution as servers become available. Backport. Signed-off-by: stack <[email protected]>
…ions (apache#4766) HADOOP-16202 "Enhance openFile()" added asynchronous draining of the remaining bytes of an S3 HTTP input stream for those operations (unbuffer, seek) where it could avoid blocking the active thread. This patch fixes the asynchronous stream draining to work and so return the stream back to the http pool. Without this, whenever unbuffer() or seek() was called on a stream and an asynchronous drain triggered, the connection was not returned; eventually the pool would be empty and subsequent S3 requests would fail with the message "Timeout waiting for connection from pool" The root cause was that even though the fields passed in to drain() were converted to references through the methods, in the lambda expression passed in to submit, they were direct references operation = client.submit( () -> drain(uri, streamStatistics, false, reason, remaining, object, wrappedStream)); /* here */ Those fields were only read during the async execution, at which point they would have been set to null (or even a subsequent read). A new SDKStreamDrainer class peforms the draining; this is a Callable and can be submitted directly to the executor pool. The class is used in both the classic and prefetching s3a input streams. Also, calling unbuffer() switches the S3AInputStream from adaptive to random IO mode; that is, it is considered a cue that future IO will not be sequential, whole-file reads. Contributed by Steve Loughran.
part of HADOOP-18103. Contributed By: Mukund Thakur
…rect buffers (apache#4787) part of HADOOP-18103. Contributed By: Mukund Thakur
… to unexpected host resolution (apache#4833) Use ".invalid" domain from IETF RFC 2606 to ensure that the host doesn't resolve. Contributed by Steve Vaughan Jr
…writes to disk. (apache#4669) Follow-up to HADOOP-12020 Support configuration of different S3 storage classes; S3 storage class is now set when buffering to heap/bytebuffers, and when creating directory markers Contributed by Monthon Klongklaew
…onfiguration (apache#4758) ITestAbfsManifestCommitProtocol to set requireRenameResilience to false for nonHNS configuration Contributed by Sree Bhattacharyya
…cs and ITestAbfsRestOperationException (apache#3699) Successor for the reverted PR apache#3341, using the hadoop @VisibleForTesting attribute Contributed by Sumangala Patki
…bled. (apache#4862) part of HADOOP-18103. While merging the ranges in CheckSumFs, they are rounded up based on the value of checksum bytes size which leads to some ranges crossing the EOF thus they need to be fixed else it will cause EOFException during actual reads. Contributed By: Mukund Thakur
…regation enabled (apache#4703) Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit 65a027b)
…lit when reading BZip2 text files (apache#4732) Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit 30c36ef)
…tion fails (apache#4896) The patch provides detailed diagnostics of file creation failure in LocalDirAllocator. Contributed by: Ashutosh Gupta
…en DN gets a reconstruction task (apache#4901) (cherry picked from commit 9a29075)
…4909) This problem surfaced in impala integration tests IMPALA-11592. TestLocalCatalogRetries.test_fetch_metadata_retry fails in S3 build after the change HADOOP-17461. Add thread-level IOStatistics Context The actual GC race condition came with HADOOP-18091. S3A auditing leaks memory through ThreadLocal references The fix for this is, if our hypothesis is correct, in WeakReferenceMap.create() where a strong reference to the new value is kept in a local variable *and referred to later* so that the JVM will not GC it. Along with the fix, extra assertions ensure that if the problem is not fixed, applications will fail faster/more meaningfully. Contributed by Steve Loughran.
…ash.moveToAppropriateTrash (apache#4869) * HADOOP-18444 Add Support for localized trash for ViewFileSystem in Trash.moveToAppropriateTrash Signed-off-by: Xing Lin <[email protected]>
Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>
…ing XML received from an untrusted source (apache#4886) Co-authored-by: Ashutosh Gupta <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit d9f435f)
Follow on to HADOOP-17461. Contributed by: Mehakmeet Singh
part of HADOOP-18103. Also introducing a config fs.s3a.vectored.active.ranged.reads to configure the maximum number of number of range reads a single input stream can have active (downloading, or queued) to the central FileSystem instance's pool of queued operations. This stops a single stream overloading the shared thread pool. Contributed by: Mukund Thakur Conflicts: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
… in DFSAdmin (apache#5667). Contributed by Haiyang Hu. Reviewed-by: Viraj Jasani <[email protected]> Signed-off-by: Ayush Saxena <[email protected]>
…estinationForPath (apache#5276) (apache#5423). Contributed by farmmamba Reviewed-by: Inigo Goiri <[email protected]> Reviewed-by: Tao Li <[email protected]> Signed-off-by: Ayush Saxena <[email protected]>
…system (apache#5704) This is a followup to HADOOP-18724. Open file fails with NumberFormatException for S3AFileSystem Contributed by Steve Loughran
(cherry picked from commit f68f1a4) Co-authored-by: zhengchenyu <[email protected]>
…d blocks from datanode (apache#5643). Contributed by hfutatzhanghb. Reviewed-by: Stephen O'Donnell <[email protected]> Reviewed-by: zhangshuyan <[email protected]> Signed-off-by: He Xiaoqiao <[email protected]> (cherry picked from commit 0e6bd09)
apache#5675) Contributed by Viraj Jasani
…iner path on yarn applications (apache#3908) Co-authored-by: Monthon Klongklaew <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>
…ache#5435) (apache#5729) This fixes CVE-2022-41881. This also upgrades io.opencensus dependencies to 0.12.3 Contributed by Aleksandr Nikolaev (cherry picked from commit 734f7ab) Conflicts: hadoop-project/pom.xml Change-Id: I26b8961725706370ac5f0fa248d0b0333034a047 Co-authored-by: nao <[email protected]>
… to reduce impact on heartbeat. (apache#5408). Contributed by ZhangHB. Signed-off-by: He Xiaoqiao <[email protected]>
This modifies the manifest committer so that the list of files to rename is passed between stages as a file of writeable entries on the local filesystem. The map of directories to create is still passed in memory; this map is built across all tasks, so even if many tasks created files, if they all write into the same set of directories the memory needed is O(directories) with the task count not a factor. The _SUCCESS file reports on heap size through gauges. This should give a warning if there are problems. Contributed by Steve Loughran
…uted by Dongjoon Hyun. Reviewed-by: Gautham B A <[email protected]> Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit fb16e00) Conflicts: hadoop-tools/hadoop-federation-balance/pom.xml
(cherry picked from commit e1bb4ac)
…eksandr Nikolaev. Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit acf82d4) (cherry picked from commit b8a76f6)
Signed-off-by: Brahma Reddy Battula <[email protected]> (cherry picked from commit 2e88096) Conflicts: LICENSE-binary (cherry picked from commit 50125e2)
Contributed By: Viraj Jasani Conflicts: LICENSE-binary (cherry picked from commit 4f6ebab)
💔 -1 overall
This message was automatically generated. |
@hadoop-yetus mvninstall error has nothing to do with my patch, due to nodejs.version is too low, so I update it to higher version |
…nd update it to latest stable version
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thank you for this. You should know that we are all scared of making changes to distCp and will need multiple reviewers on this. It is just a critical part of so much workflow and fairly old code.
Here is the review criteria I use to review things, afraid I am as strict on test code as production and like tests to stress of the fairly handling of everything as that's often the code path which is not tested in production.
https://github.com/steveloughran/formality/blob/master/styleguide/styleguide.md
So: embrace assertJ and add tests for the failure conditions your code looks for (files not found etc)
And we need the distcp documentation updated. thanks.
return true; | ||
} | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RuntimeIOException
package org.apache.hadoop.tools; | ||
|
||
import java.io.IOException; | ||
import org.apache.hadoop.conf.Configuration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afraid you need to use the same import ordering as everything else -its a real source of merge pain and we try to stay in control
java*
non-apache.*
org.apache.*
static *
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class DirCopyFilter extends FileStatusCopyFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs javadocs, and know that our indentation policy is "two spaces"
return shouldCopy(fileStatus.getPath()); | ||
} | ||
|
||
public boolean supportFileStatus(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for? add javadoc to explain.
@@ -46,6 +46,14 @@ public void initialize() {} | |||
*/ | |||
public abstract boolean shouldCopy(Path path); | |||
|
|||
public boolean shouldCopy(CopyListingFileStatus fileStatus){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs javadoc
doubleCheckedTotalSize += currentSplitSize; | ||
} | ||
|
||
Assert.assertEquals(totalFileSize, doubleCheckedTotalSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, assertJ here and everywhere below
return size; | ||
} | ||
finally { | ||
IOUtils.cleanupWithLogger(null, fileSystem, outputStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't close the filesystem; it is needed for the other tests
} | ||
|
||
@AfterClass | ||
public static void tearDown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- rename tearDownClass(); move up to just below the setup method
- wrap the shutdown with
if (cluster != null) {...}
FileSystem fs = sourcePath.getFileSystem(configuration); | ||
FileStatus fileStatus [] = fs.listStatus(sourcePath); | ||
if (fileStatus.length > 1) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment to explain
} | ||
|
||
//Verify there is nothing more to read from the input file | ||
SequenceFile.Reader reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use a try-with-resources for the automatic close
I will reopen a pull request, and merge the pull request to the trunk. |
thanks. link to it from here so i'll see it |
I made a new PR 6070 , @steveloughran |
Description of PR
In some circumstances, we need to filter file/directory by file/directroy. For example, we need to filter out them by file modified time, isDir attrs, etc.
So, should we introduce a new method public boolean shouldCopy(CopyListingFileStatus fileStatus) ?
by this approach, we can introduce a more fluent way to do things than public abstract boolean shouldCopy(Path path).
To achieve the goal:
1、Create a method named shouldCopy(CopyListingFileStatus fileStatus) in CopyFilter abstract method, with a supportFileStatus() swtich method which return false by default.
2、For subclasses which impl the abstract class and want to use the new method, should overwrite shouldCopy(CopyListingFileStatus fileStatus) and for the same time, return supportFileStatus() to true.
3、This change is compatible with old use case.
As a impl:
1、I first create a abstract FileStatusCopyFilter extends CopyFilter
2、then create DirCopyFilter class extends FileStatusCopyFilter
3、and , implement UniformRecordInputFormat to support DirCopyFilter
How was this patch tested?
added unit tests
1、add distcp.filters.class=org.apache.hadoop.tools.DirCopyFilter to distcp-default.xml or set it by -Ddistcp.filters.class=org.apache.hadoop.tools.DirCopyFilter
2、then execute distcp commands
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?