-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary columns do not receive truncated statistics #5037
Comments
Support for this was only added to the parquet standard 3 weeks ago - apache/parquet-format#216 TBC this will be a breaking API change, as it will break workloads expecting the statistics to not be truncated |
Got it. It looks like the change to enable statistics truncation was done to parquet-mr in 2019, but without the flag: apache/parquet-java#696 So in principle, you couldn’t trust the completeness of binary column stats before then as there was no indication as to whether truncation had occurred or not. |
Hmm... I also note that it is disabled by default, is this still the case? Regardless I think we should probably only perform this in the context of apache/parquet-format#216 as whilst parquet-mr would appear to be configurable to perform binary truncation, I'm fairly confident there are applications that have implicit assumptions that this would break. FYI @alamb my memory is hazy as to what forms of aggregate pushdown DF performs, and if we might need to introduce some notion of inexact statistics (if it doesn't already exist). |
I'm happy to work up a PR that implements this in the same way, also disabled by default, for parquet-rs. |
Thank you, that would be great |
I think the recent work by @berkaysynnada to add https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36 might be relevant However, I think it is likely we will/should eventually add another variant like
I believe we have a similar usecase in IOx for when we want to ensure the bound includes the actual range, but could be larger (cc @NGA-TRAN ) |
I think so too, adding a range-specifying variant will pave the way for many things. While I have other high-priority tasks to address shortly, I'm always available to offer support if someone wishes to take this on. The variant I have in mind is as follows:
It will also be easier to use after updating intervals (planning to open the related PR in a few days). |
I filed apache/datafusion#8078 with a proposal of a more precise way to represent inexact statistics |
|
Describe the bug
#4389 introduced truncation on column indices for binary columns, where the min/max values for a binary column may be arbitrarily large. As noted, this matches the behaviour in parquet-mr for shortening columns.
However, the value in the statistics is written un-truncated. This differs from the behaviour of parquet-mr where the statistics are truncated too: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715
To Reproduce
There is a test in delta-io/delta-rs#1805 which demonstrates this, but in general write a parquet file with a long binary column and observe that the stats for that column are not truncated.
Expected behavior
Matching parquet-mr, the statistics should be truncated as well.
Additional context
Found this when looking into delta-io/delta-rs#1805. delta-rs uses the column stats to serialize into the delta log, which leads to very bloated entries.
I think it is sufficient to just call truncate_min_value/truncate_max_value when creating the column metadata here: https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859 but I don't know enough about the internals of arrow to know if that change is correct.
The text was updated successfully, but these errors were encountered: