Skip to content

Change default parquet statistics truncation to be 64 bytes #7578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 31, 2025

Which issue does this PR close?

Rationale for this change

Statistics for large columns (e.g. large strings) are typically not useful for min/max value pruning.

However, the current defaults in parquet-rs will store the entire min and max value.

For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each data page header

What changes are included in this PR?

Change default statistics truncation size to be 64 to match the default for truncating PageIndex statistics

Are there any user-facing changes?

This is a user facing change -- I expect users will see:

  1. Smaller parquet metadata (and thus smaller parquet files)
  2. Faster load times (as the metadata is smaller)

It is an API change, so we should wait to merge this until the next major release

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label May 31, 2025
@github-actions github-actions bot added the parquet Changes to the parquet crate label May 31, 2025
@alamb alamb added the api-change Changes to the arrow API label May 31, 2025
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@alamb alamb changed the title Change default statistics truncation to be 64 bytes Change default parquet statistics truncation to be 64 bytes Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider a default max_statistics_truncate_length
3 participants