GH-34375: [C++][Parquet] Ignore page header stats when page index enabled #35455

wgtmac · 2023-05-06T02:47:17Z

Rationale for this change

Page-level statistics are probably not used in production, and after adding column indexes they are useless.
parquet-mr already stopped writing them in https://issues.apache.org/jira/browse/PARQUET-1365.

What changes are included in this PR?

Once page index is enabled for one column, it does not write page stats to the header any more.

Are these changes tested?

Added a test to check page stats have been skipped.

Are there any user-facing changes?

Yes (behavior change when page index is enabled).

Closes: [C++][Parquet] Page header does not save statistics once page index is enabled #34375

github-actions · 2023-05-06T02:47:39Z

Closes: [C++][Parquet] Page header does not save statistics once page index is enabled #34375

cpp/src/parquet/column_writer.cc

wgtmac · 2023-05-09T15:19:01Z

Please take a look when you have time, thanks! @wjones127 @emkornfield

cc @fatemehp since this change may affect page filtering based on header stats.

wjones127

I am good with this. Since this is change in behavior, I'll wait for others to chime in before merging.

emkornfield · 2023-05-09T22:13:43Z

Seems OK to me as writing column indices is opt-in?

wgtmac · 2023-05-10T14:23:34Z

Seems OK to me as writing column indices is opt-in?

Yes, by default page index is disabled. I am not sure if we should turn on it by default someday to align with what parquet-mr does.

wjones127 · 2023-05-10T20:56:22Z

Yes, by default page index is disabled. I am not sure if we should turn on it by default someday to align with what parquet-mr does.

I think we eventually want to turn it on by default. We just release support for it in 12.0.0, but it's not yet used for any data skipping in the Parquet reader or dataset scanner (right?). I think we should add support for leveraging it in the readers in 13.0.0, and then if it seems to be working for users we can turn in on by default in 14.0.0. Does that seem like a reasonable plan?

wjones127 · 2023-05-10T21:02:37Z

Added some TODOs in this issue: #26168

wgtmac · 2023-05-11T01:46:26Z

I think we should add support for leveraging it in the readers in 13.0.0

Yes, but I don't think the time frame is sufficient for a complete page skipping implementation. We can support it progressively.

wjones127 · 2023-05-11T16:03:15Z

Sure. So I suppose modify my proposal to: if we fully support predicate pushdown for page index in version N, let's set this to default in version N + 1 (assuming there are no unresolved issues we find).

mapleFU

Personally I'm good with this, but maybe some document is required. I guess maybe a user previously uses page filter. Now when he/she enables page index, the filter would not work

About statistics, though nan will break the page index, and page statistics and make full use of nan-counts. I think this on-going patch helps apache/parquet-format#196

wgtmac · 2023-05-11T16:19:21Z

I guess maybe a user previously uses page filter. Now when he/she enables page index, the filter would not work

I think the documentation is also missing for stats filtering on the page header. Since page index is an ongoing effort and it is not enabled by default, users will not get confused by this for now.

About statistics, though nan will break the page index, and page statistics and make full use of nan-counts. I think this on-going patch helps apache/parquet-format#196

Due to the inactivity in that PR, I assume it will take a long time to move forward as we need somehow to reach a consensus from the community.

wgtmac · 2023-05-24T14:18:00Z

@pitrou Could you take a look as well? Thanks!

pitrou

See comment below. Also the doc should be updated, and perhaps also the WriterProperties docstrings: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties

cpp/src/parquet/column_writer.cc

wgtmac · 2023-06-06T08:28:13Z

Could you please take a look again? @pitrou

pitrou · 2023-06-06T08:56:13Z

cpp/src/parquet/properties.h

@@ -525,7 +525,8 @@ class PARQUET_EXPORT WriterProperties {
    /// Enable writing page index in general for all columns. Default disabled.
    ///
    /// Page index contains statistics for data pages and can be used to skip pages
-    /// when scanning data in ordered and unordered columns.
+    /// when scanning data in ordered and unordered columns. Note that it does not


What is "it" in this context? The sentence isn't very clear.

Here "it" denotes to the parquet writer or column writer.

Perhaps:

Writing statistics to the page index disables the old method of writing statistics to each data page header.
The page index makes filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O.

…ex enabled

pitrou · 2023-06-06T09:03:56Z

docs/source/cpp/parquet.rst

 +--------------------------+----------+----------+---------+

-* \(1) Access to the Column and Offset Index structures is provided, but
-  data read APIs do not currently make any use of them.


Why remove this note? AFAIU it is still valid (we do not expose any high-level filtering feature).

pitrou · 2023-06-06T09:05:16Z

Did you try to benchmark writing the page index vs. writing statistics in data page headers?
Perhaps in the future we can enable the page index by default?

wgtmac · 2023-06-06T09:23:09Z

Did you try to benchmark writing the page index vs. writing statistics in data page headers? Perhaps in the future we can enable the page index by default?

I didn't benchmark the writer time because it should spend more time when page index is enabled because:

It only skips writing stats to the thrift-encoded header but the stats comparison (which is heavy) in the column writer still does the job.
Page index builder also does more work than just serializing stats. It also collects sorting order from page stats, collects page offsets and so on.

The main goal of skipping writing stats to page header mainly is to reduce the file size as they are duplicated and easier to get from the column index. We have internally enabled page index by default. The benefit brought by page index is promising.

pitrou · 2023-06-06T09:35:53Z

cpp/src/parquet/properties.h

@@ -541,6 +544,8 @@ class PARQUET_EXPORT WriterProperties {
    }

    /// Enable writing page index for column specified by `path`. Default disabled.
+    /// Note that it does not write statistics to the page header once page index is
+    /// enabled.


Update this comment as well? Or just remove it as things are explained in more detail above.

Sorry for missing that! Fixed.

pitrou · 2023-06-06T09:40:45Z

We have internally enabled page index by default. The benefit brought by page index is promising.

Did you measure the benefits (other than file size)?

wgtmac · 2023-06-06T09:56:27Z

Yes, it could achieve 2X~4X acceleration of reading column chunk from cloud object store using page offset index (with I/O coalescing) compared to issue single I/O of a full column chunk (which is currently by default).

pitrou · 2023-06-06T09:57:51Z

Nice, I see. What about the column index, are you using it?

wgtmac · 2023-06-06T10:02:44Z

Nice, I see. What about the column index, are you using it?

Yes, we did a naive implementation of predicate push down similar to what the parquet-mr does.

ursabot · 2023-06-06T16:32:35Z

Benchmark runs are scheduled for baseline = 68cbc6f and contender = 7f8ccb5. 7f8ccb5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 7f8ccb5f ec2-t3-xlarge-us-east-2
[Failed] 7f8ccb5f test-mac-arm
[Failed] 7f8ccb5f ursa-i9-9960x
[Failed] 7f8ccb5f ursa-thinkcentre-m75q
[Failed] 68cbc6fe ec2-t3-xlarge-us-east-2
[Failed] 68cbc6fe test-mac-arm
[Failed] 68cbc6fe ursa-i9-9960x
[Failed] 68cbc6fe ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

fatemehp · 2023-06-14T00:53:24Z

@wgtmac could you elaborate on how you are achieving 2X~4X acceleration of reading the column chunk from the cloud object store you mentioned above? Are you reading the column chunk page by page?

wgtmac · 2023-06-14T01:44:35Z

@wgtmac could you elaborate on how you are achieving 2X~4X acceleration of reading the column chunk from the cloud object store you mentioned above? Are you reading the column chunk page by page?

In short, collect all offset/length ranges of required pages, then coalesce them into reasonable I/O chunks and issue async reads before reading any page.

wgtmac requested a review from wjones127 as a code owner May 6, 2023 02:47

github-actions bot added Component: C++ Component: Parquet awaiting review Awaiting review labels May 6, 2023

mapleFU reviewed May 6, 2023

View reviewed changes

cpp/src/parquet/column_writer.cc Show resolved Hide resolved

cpp/src/parquet/column_writer.cc Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 6, 2023

wgtmac force-pushed the page_stats branch from 042e912 to 9472537 Compare May 8, 2023 14:22

wjones127 approved these changes May 9, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 9, 2023

mapleFU approved these changes May 11, 2023

View reviewed changes

wgtmac force-pushed the page_stats branch from 9472537 to 2afc37d Compare May 24, 2023 07:01

pitrou requested changes May 24, 2023

View reviewed changes

cpp/src/parquet/column_writer.cc Show resolved Hide resolved

github-actions bot added the Component: Documentation label May 26, 2023

wgtmac requested a review from pitrou June 1, 2023 14:42

pitrou reviewed Jun 6, 2023

View reviewed changes

wgtmac added 2 commits June 6, 2023 17:03

apacheGH-34375: [C++][Parquet] Ignore page header stats when page ind…

71db832

…ex enabled

update docstring

9c2f002

pitrou reviewed Jun 6, 2023

View reviewed changes

modify comment and doc

cd7414d

wgtmac force-pushed the page_stats branch from 7d80037 to cd7414d Compare June 6, 2023 09:28

pitrou reviewed Jun 6, 2023

View reviewed changes

remove confusing line

336058c

pitrou approved these changes Jun 6, 2023

View reviewed changes

pitrou merged commit 7f8ccb5 into apache:main Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34375: [C++][Parquet] Ignore page header stats when page index enabled #35455

GH-34375: [C++][Parquet] Ignore page header stats when page index enabled #35455

wgtmac commented May 6, 2023 •

edited by pitrou

Loading

github-actions bot commented May 6, 2023

wgtmac commented May 9, 2023

wjones127 left a comment

emkornfield commented May 9, 2023

wgtmac commented May 10, 2023

wjones127 commented May 10, 2023

wjones127 commented May 10, 2023

wgtmac commented May 11, 2023

wjones127 commented May 11, 2023

mapleFU left a comment

wgtmac commented May 11, 2023

wgtmac commented May 24, 2023

pitrou left a comment

wgtmac commented Jun 6, 2023

pitrou Jun 6, 2023

wgtmac Jun 6, 2023

pitrou Jun 6, 2023

pitrou Jun 6, 2023

pitrou commented Jun 6, 2023

wgtmac commented Jun 6, 2023

pitrou Jun 6, 2023 •

edited

Loading

wgtmac Jun 6, 2023

pitrou commented Jun 6, 2023

wgtmac commented Jun 6, 2023

pitrou commented Jun 6, 2023

wgtmac commented Jun 6, 2023

ursabot commented Jun 6, 2023

fatemehp commented Jun 14, 2023

wgtmac commented Jun 14, 2023

GH-34375: [C++][Parquet] Ignore page header stats when page index enabled #35455

GH-34375: [C++][Parquet] Ignore page header stats when page index enabled #35455

Conversation

wgtmac commented May 6, 2023 • edited by pitrou Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 6, 2023

wgtmac commented May 9, 2023

wjones127 left a comment

Choose a reason for hiding this comment

emkornfield commented May 9, 2023

wgtmac commented May 10, 2023

wjones127 commented May 10, 2023

wjones127 commented May 10, 2023

wgtmac commented May 11, 2023

wjones127 commented May 11, 2023

mapleFU left a comment

Choose a reason for hiding this comment

wgtmac commented May 11, 2023

wgtmac commented May 24, 2023

pitrou left a comment

Choose a reason for hiding this comment

wgtmac commented Jun 6, 2023

pitrou Jun 6, 2023

Choose a reason for hiding this comment

wgtmac Jun 6, 2023

Choose a reason for hiding this comment

pitrou Jun 6, 2023

Choose a reason for hiding this comment

pitrou Jun 6, 2023

Choose a reason for hiding this comment

pitrou commented Jun 6, 2023

wgtmac commented Jun 6, 2023

pitrou Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

wgtmac Jun 6, 2023

Choose a reason for hiding this comment

pitrou commented Jun 6, 2023

wgtmac commented Jun 6, 2023

pitrou commented Jun 6, 2023

wgtmac commented Jun 6, 2023

ursabot commented Jun 6, 2023

fatemehp commented Jun 14, 2023

wgtmac commented Jun 14, 2023

wgtmac commented May 6, 2023 •

edited by pitrou

Loading

pitrou Jun 6, 2023 •

edited

Loading