-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34375: [C++][Parquet] Ignore page header stats when page index enabled #35455
Conversation
Please take a look when you have time, thanks! @wjones127 @emkornfield cc @fatemehp since this change may affect page filtering based on header stats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with this. Since this is change in behavior, I'll wait for others to chime in before merging.
Seems OK to me as writing column indices is opt-in? |
Yes, by default page index is disabled. I am not sure if we should turn on it by default someday to align with what parquet-mr does. |
I think we eventually want to turn it on by default. We just release support for it in 12.0.0, but it's not yet used for any data skipping in the Parquet reader or dataset scanner (right?). I think we should add support for leveraging it in the readers in 13.0.0, and then if it seems to be working for users we can turn in on by default in 14.0.0. Does that seem like a reasonable plan? |
Added some TODOs in this issue: #26168 |
Yes, but I don't think the time frame is sufficient for a complete page skipping implementation. We can support it progressively. |
Sure. So I suppose modify my proposal to: if we fully support predicate pushdown for page index in version N, let's set this to default in version N + 1 (assuming there are no unresolved issues we find). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'm good with this, but maybe some document is required. I guess maybe a user previously uses page filter. Now when he/she enables page index, the filter would not work
About statistics, though nan will break the page index, and page statistics and make full use of nan-counts. I think this on-going patch helps apache/parquet-format#196
I think the documentation is also missing for stats filtering on the page header. Since page index is an ongoing effort and it is not enabled by default, users will not get confused by this for now.
Due to the inactivity in that PR, I assume it will take a long time to move forward as we need somehow to reach a consensus from the community. |
@pitrou Could you take a look as well? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below. Also the doc should be updated, and perhaps also the WriterProperties
docstrings: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties
Could you please take a look again? @pitrou |
cpp/src/parquet/properties.h
Outdated
@@ -525,7 +525,8 @@ class PARQUET_EXPORT WriterProperties { | |||
/// Enable writing page index in general for all columns. Default disabled. | |||
/// | |||
/// Page index contains statistics for data pages and can be used to skip pages | |||
/// when scanning data in ordered and unordered columns. | |||
/// when scanning data in ordered and unordered columns. Note that it does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "it" in this context? The sentence isn't very clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here "it" denotes to the parquet writer or column writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps:
Writing statistics to the page index disables the old method of writing statistics to each data page header.
The page index makes filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O.
+--------------------------+----------+----------+---------+ | ||
|
||
* \(1) Access to the Column and Offset Index structures is provided, but | ||
data read APIs do not currently make any use of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this note? AFAIU it is still valid (we do not expose any high-level filtering feature).
Did you try to benchmark writing the page index vs. writing statistics in data page headers? |
I didn't benchmark the writer time because it should spend more time when page index is enabled because:
The main goal of skipping writing stats to page header mainly is to reduce the file size as they are duplicated and easier to get from the column index. We have internally enabled page index by default. The benefit brought by page index is promising. |
cpp/src/parquet/properties.h
Outdated
@@ -541,6 +544,8 @@ class PARQUET_EXPORT WriterProperties { | |||
} | |||
|
|||
/// Enable writing page index for column specified by `path`. Default disabled. | |||
/// Note that it does not write statistics to the page header once page index is | |||
/// enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this comment as well? Or just remove it as things are explained in more detail above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing that! Fixed.
Did you measure the benefits (other than file size)? |
Yes, it could achieve 2X~4X acceleration of reading column chunk from cloud object store using page offset index (with I/O coalescing) compared to issue single I/O of a full column chunk (which is currently by default). |
Nice, I see. What about the column index, are you using it? |
Yes, we did a naive implementation of predicate push down similar to what the parquet-mr does. |
Benchmark runs are scheduled for baseline = 68cbc6f and contender = 7f8ccb5. 7f8ccb5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
@wgtmac could you elaborate on how you are achieving 2X~4X acceleration of reading the column chunk from the cloud object store you mentioned above? Are you reading the column chunk page by page? |
In short, collect all offset/length ranges of required pages, then coalesce them into reasonable I/O chunks and issue async reads before reading any page. |
Rationale for this change
Page-level statistics are probably not used in production, and after adding column indexes they are useless.
parquet-mr already stopped writing them in https://issues.apache.org/jira/browse/PARQUET-1365.
What changes are included in this PR?
Once page index is enabled for one column, it does not write page stats to the header any more.
Are these changes tested?
Added a test to check page stats have been skipped.
Are there any user-facing changes?
Yes (behavior change when page index is enabled).