Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HadoopInputFile to pass down FileStatus when opening file. #2915

Open
asfimport opened this issue Jun 8, 2024 · 0 comments
Open

HadoopInputFile to pass down FileStatus when opening file. #2915

asfimport opened this issue Jun 8, 2024 · 0 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 8, 2024

In the current version of the HadoopInputFile implementation:

https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java

 

When performing a newStream, the reference to the FileStatus is lost, which has already been previously consulted to create this class. This means that when you go to the implementation of each FileSystem, it will surely have to be requested again, since you have requested the reference of whether the file exists, when the file is weighed or relevant information to be able to open the file.

Hadoop's openFile() builder API does support this, but it is not on older releases, so until Parquet moves to Hadoop 3.2.0+ only it cannot use the API. And because its a complex and extensible design, it's very hard to use reflection.

HADOOP-19131 adds reflection-friendly entry points for this and other operations, so for releases with the new class, Parquet can pick up the speedup.

Reporter: Oliver Caballero Alvarez

Related issues:

Note: This issue was originally created as PARQUET-2493. Please see the migration documentation for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant