Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1458: Reduce HDFS NameNode getFileInfo RPC #1767

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Jan 22, 2024

What changes were proposed in this pull request?

If the file in HDFS is in a completed state, avoid calling the HDFS getFileInfo RPC.

Provide orc.file.length.fast configuration to enable this behavior.

Why are the changes needed?

Now reading an ORC file in HDFS will generate at least one open and getFileInfo RPC. This optimization can remove getFileInfo RPC as much as possible, improve ORC reading efficiency, and reduce the number of HDFS RPCs.

How was this patch tested?

The production environment has been running stably for several months

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the JAVA label Jan 22, 2024

public class OrcInputStreamUtil {

private static final String DFS_CLASS = "org.apache.hadoop.hdfs.DFSInputStream";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we reflect this when Apache ORC can access this directly?

$ javap -cp hadoop-client-api-3.3.6.jar org.apache.hadoop.hdfs.DFSInputStream | head -n3
Compiled from "DFSInputStream.java"
public class org.apache.hadoop.hdfs.DFSInputStream extends org.apache.hadoop.fs.FSInputStream implements org.apache.hadoop.fs.ByteBufferReadable,org.apache.hadoop.fs.CanSetDropBehind,org.apache.hadoop.fs.CanSetReadahead,org.apache.hadoop.fs.HasEnhancedByteBufferAccess,org.apache.hadoop.fs.CanUnbuffer,org.apache.hadoop.fs.StreamCapabilities,org.apache.hadoop.fs.ByteBufferPositionedReadable {
  public static boolean tcpReadsDisabledForTesting;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.hadoop.hdfs.DFSInputStream.shortCircuitForbidden Method is not public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before ORC-1430, we could not access hdfs-related classes in the core module, so I used reflection.
Now I have done some refactoring, but the method of shortCircuitForbidden is not public, so I still need to use reflection.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a high-risky way which Apache Hadoop community is also not recommended.

May I ask if you have any reference in ASF communities which adopts this hack, @cxzl25 ?

@dongjoon-hyun
Copy link
Member

Gentle ping, @cxzl25 .

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 24, 2024

This is a high-risky way which Apache Hadoop community is also not recommended.

May I ask if you have any reference in ASF communities which adopts this hack

@Hexiaoqiao Can you help review this PR?

@Hexiaoqiao
Copy link

Thanks involve me here. I have left my concerns here, I would like to reassert: I believe this could reduce load to NameNode when using ORC on HDFS but I am worried it involve any potential issues from HDFS view.

Back to this PR, IIUC this is used only for extract footer, right? Technically, for one immutable file, the result from DFSInputStream#getFileLength() and FileStatus#getLen() will be same, but it is not one recommended usage as @dongjoon-hyun mentioned above.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants