Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lazy materialization of row groups in ParquetFileReader #2884

Open
asfimport opened this issue Mar 5, 2024 · 3 comments
Open

Support lazy materialization of row groups in ParquetFileReader #2884

asfimport opened this issue Mar 5, 2024 · 3 comments

Comments

@asfimport
Copy link
Collaborator

Motivation: The current behavior of ParquetFilterReader#readNextRowGroup is to eagerly enumerate all chunks in the row group, then read all pages in the chunk. For distributed data workloads, this can cause significant memory pressure, particularly for use cases that require the colocation of multiple Parquet files on a single worker.

 

Proposal: A Parquet Configuration option that enables lazy row group reading, i.e., only a page at a time (plus whatever header is necessary to read that header). The Configuration option could be either a flag, or an int value for how many pages/page bytes to buffer at a time.

 

I think this could be accomplished by modifying ParquetFileReader#readAllPages to re-implement pagesInChunk as an Iterator, rather than a List. Then, ColumnChunkPageReader could parse the Configuration option above and decide whether to fully materialize the iterator or not.

 

I'm happy to try to create a draft/branch for this to get some early feedback on the idea!

Reporter: Claire McGinty / @clairemcginty

PRs and other links:

Note: This issue was originally created as PARQUET-2443. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Claire McGinty / @clairemcginty:
I pushed a branch implementing the DataPage-as-Iterator idea, here: master...clairemcginty:parquet-mr:lazy-chunkreader

 

However... it looks like ColumnReaderBase#checkRead works by continually invoking readPage until the row group is fully consumed, so the row group effectively gets materialized there, even if my ColumnChunkPageReader is now backed by lazy Iterator. Any pointers on how I should modify that code block? 

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
I need take some time to read related code to get familiar with the context.

 

In the meanwhile, Apache Iceberg uses an iterator pattern to read from values, pages and even files. It may be helpful to check this out if you haven't: https://github.com/apache/iceberg/blob/2519ab43d654927802cc02e19c917ce90e8e0265/parquet/src/main/java/org/apache/iceberg/parquet/BasePageIterator.java#L40

@asfimport
Copy link
Collaborator Author

Claire McGinty / @clairemcginty:
Thanks Gang! I'll check out the Iceberg pattern. My implementation was as a java Iterator, which is a bit tricky because a chunk is a mixture of dictionary+data pages, so there are some awkward workarounds.

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant