Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34375: [C++][Parquet] Ignore page header stats when page index enabled #35455

Merged
merged 4 commits into from
Jun 6, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented May 6, 2023

Rationale for this change

Page-level statistics are probably not used in production, and after adding column indexes they are useless.
parquet-mr already stopped writing them in https://issues.apache.org/jira/browse/PARQUET-1365.

What changes are included in this PR?

Once page index is enabled for one column, it does not write page stats to the header any more.

Are these changes tested?

Added a test to check page stats have been skipped.

Are there any user-facing changes?

Yes (behavior change when page index is enabled).

@wgtmac wgtmac requested a review from wjones127 as a code owner May 6, 2023 02:47
@github-actions
Copy link

github-actions bot commented May 6, 2023

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 6, 2023
@wgtmac
Copy link
Member Author

wgtmac commented May 9, 2023

Please take a look when you have time, thanks! @wjones127 @emkornfield

cc @fatemehp since this change may affect page filtering based on header stats.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with this. Since this is change in behavior, I'll wait for others to chime in before merging.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 9, 2023
@emkornfield
Copy link
Contributor

Seems OK to me as writing column indices is opt-in?

@wgtmac
Copy link
Member Author

wgtmac commented May 10, 2023

Seems OK to me as writing column indices is opt-in?

Yes, by default page index is disabled. I am not sure if we should turn on it by default someday to align with what parquet-mr does.

@wjones127
Copy link
Member

Yes, by default page index is disabled. I am not sure if we should turn on it by default someday to align with what parquet-mr does.

I think we eventually want to turn it on by default. We just release support for it in 12.0.0, but it's not yet used for any data skipping in the Parquet reader or dataset scanner (right?). I think we should add support for leveraging it in the readers in 13.0.0, and then if it seems to be working for users we can turn in on by default in 14.0.0. Does that seem like a reasonable plan?

@wjones127
Copy link
Member

Added some TODOs in this issue: #26168

@wgtmac
Copy link
Member Author

wgtmac commented May 11, 2023

I think we should add support for leveraging it in the readers in 13.0.0

Yes, but I don't think the time frame is sufficient for a complete page skipping implementation. We can support it progressively.

@wjones127
Copy link
Member

Sure. So I suppose modify my proposal to: if we fully support predicate pushdown for page index in version N, let's set this to default in version N + 1 (assuming there are no unresolved issues we find).

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm good with this, but maybe some document is required. I guess maybe a user previously uses page filter. Now when he/she enables page index, the filter would not work

About statistics, though nan will break the page index, and page statistics and make full use of nan-counts. I think this on-going patch helps apache/parquet-format#196

@wgtmac
Copy link
Member Author

wgtmac commented May 11, 2023

I guess maybe a user previously uses page filter. Now when he/she enables page index, the filter would not work

I think the documentation is also missing for stats filtering on the page header. Since page index is an ongoing effort and it is not enabled by default, users will not get confused by this for now.

About statistics, though nan will break the page index, and page statistics and make full use of nan-counts. I think this on-going patch helps apache/parquet-format#196

Due to the inactivity in that PR, I assume it will take a long time to move forward as we need somehow to reach a consensus from the community.

@wgtmac
Copy link
Member Author

wgtmac commented May 24, 2023

@pitrou Could you take a look as well? Thanks!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below. Also the doc should be updated, and perhaps also the WriterProperties docstrings: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties

cpp/src/parquet/column_writer.cc Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Jun 6, 2023

Could you please take a look again? @pitrou

@@ -525,7 +525,8 @@ class PARQUET_EXPORT WriterProperties {
/// Enable writing page index in general for all columns. Default disabled.
///
/// Page index contains statistics for data pages and can be used to skip pages
/// when scanning data in ordered and unordered columns.
/// when scanning data in ordered and unordered columns. Note that it does not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "it" in this context? The sentence isn't very clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here "it" denotes to the parquet writer or column writer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps:

Writing statistics to the page index disables the old method of writing statistics to each data page header.
The page index makes filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O.

+--------------------------+----------+----------+---------+

* \(1) Access to the Column and Offset Index structures is provided, but
data read APIs do not currently make any use of them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this note? AFAIU it is still valid (we do not expose any high-level filtering feature).

@pitrou
Copy link
Member

pitrou commented Jun 6, 2023

Did you try to benchmark writing the page index vs. writing statistics in data page headers?
Perhaps in the future we can enable the page index by default?

@wgtmac
Copy link
Member Author

wgtmac commented Jun 6, 2023

Did you try to benchmark writing the page index vs. writing statistics in data page headers? Perhaps in the future we can enable the page index by default?

I didn't benchmark the writer time because it should spend more time when page index is enabled because:

  • It only skips writing stats to the thrift-encoded header but the stats comparison (which is heavy) in the column writer still does the job.
  • Page index builder also does more work than just serializing stats. It also collects sorting order from page stats, collects page offsets and so on.

The main goal of skipping writing stats to page header mainly is to reduce the file size as they are duplicated and easier to get from the column index. We have internally enabled page index by default. The benefit brought by page index is promising.

@@ -541,6 +544,8 @@ class PARQUET_EXPORT WriterProperties {
}

/// Enable writing page index for column specified by `path`. Default disabled.
/// Note that it does not write statistics to the page header once page index is
/// enabled.
Copy link
Member

@pitrou pitrou Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this comment as well? Or just remove it as things are explained in more detail above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing that! Fixed.

@pitrou
Copy link
Member

pitrou commented Jun 6, 2023

We have internally enabled page index by default. The benefit brought by page index is promising.

Did you measure the benefits (other than file size)?

@wgtmac
Copy link
Member Author

wgtmac commented Jun 6, 2023

Yes, it could achieve 2X~4X acceleration of reading column chunk from cloud object store using page offset index (with I/O coalescing) compared to issue single I/O of a full column chunk (which is currently by default).

@pitrou
Copy link
Member

pitrou commented Jun 6, 2023

Nice, I see. What about the column index, are you using it?

@wgtmac
Copy link
Member Author

wgtmac commented Jun 6, 2023

Nice, I see. What about the column index, are you using it?

Yes, we did a naive implementation of predicate push down similar to what the parquet-mr does.

@pitrou pitrou merged commit 7f8ccb5 into apache:main Jun 6, 2023
@ursabot
Copy link

ursabot commented Jun 6, 2023

Benchmark runs are scheduled for baseline = 68cbc6f and contender = 7f8ccb5. 7f8ccb5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 7f8ccb5f ec2-t3-xlarge-us-east-2
[Failed] 7f8ccb5f test-mac-arm
[Failed] 7f8ccb5f ursa-i9-9960x
[Failed] 7f8ccb5f ursa-thinkcentre-m75q
[Failed] 68cbc6fe ec2-t3-xlarge-us-east-2
[Failed] 68cbc6fe test-mac-arm
[Failed] 68cbc6fe ursa-i9-9960x
[Failed] 68cbc6fe ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@fatemehp
Copy link
Contributor

@wgtmac could you elaborate on how you are achieving 2X~4X acceleration of reading the column chunk from the cloud object store you mentioned above? Are you reading the column chunk page by page?

@wgtmac
Copy link
Member Author

wgtmac commented Jun 14, 2023

@wgtmac could you elaborate on how you are achieving 2X~4X acceleration of reading the column chunk from the cloud object store you mentioned above? Are you reading the column chunk page by page?

In short, collect all offset/length ranges of required pages, then coalesce them into reasonable I/O chunks and issue async reads before reading any page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Page header does not save statistics once page index is enabled
7 participants