Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T1] Allow fixed length encoding for min/max and deprecate encoding_stats #252

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alkis
Copy link

@alkis alkis commented May 29, 2024

  1. Add min8/max8 fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes.
  2. Deprecate ColumnMetaData.encoding_stats and replace with a bool ColumnMetaData.is_fully_dict_encoded'

ref Parquet Metadata evolution

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

* Only one pair of max_value/min_value, max1/min1, max2/min2, max4/min4,
* max8/min8 can be set. The pair is determined by the physical type of the
* column. Floating point values are bitcasted to integers. Variable length
* values are set in min_value/max_value.
Copy link
Contributor

@emkornfield emkornfield May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the docs for readers for backwards compatibility should check min_value/max_value if the non-variable width field is not not set?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten this to be clearer.

*/
5: optional binary max_value;
6: optional binary min_value;
/** If true, max_value is the actual maximum value for a column */
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;
9: optional i64 max8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intentionally elide min1/max1? (they are still mentioned above).

Copy link
Author

@alkis alkis May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I removed them because they provide little benefit and do not justify the added complexity. This is because in thrift these are ulebs so it makes no difference in the wire. For flatbuffers this would make a difference though.

@@ -810,9 +803,13 @@ struct ColumnMetaData {
/** optional statistics for this column chunk */
12: optional Statistics statistics;

/** Set of all encodings used for pages in this column chunk.
/**
* DEPRECATED: use is_fully_dict_encoded instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest making this a separate PR, I think we'd prefer to keep the changes as small and focused as possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will create another PR for the Statistics change alone if we are OK merging that now.

…tats

1. Add `min8`/`max8` fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes.
2. Deprecate `ColumnMetaData.encoding_stats` and replace with a bool `ColumnMetaData.is_fully_dict_encoded`
@alkis alkis force-pushed the t1-metadata-improvements branch from b2caa21 to ec13f34 Compare May 30, 2024 09:27
@alkis alkis changed the title [T1] Allow fixed length encoding for min/max [T1] Allow fixed length encoding for min/max and deprecate encoding_stats May 30, 2024
* the columns ColumnOrder
* max_value/min_value: PLAIN encoded values, sans length prefix if varlen
* max8/min8: up to 8-bytes:
* FLOAT, DOUBLE: bitcasted to INT32 and INT64, respectively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to be more specific here about values less then 8 bytes are translated into 8 bytes. In practice it doesn't make a difference for readers but it would be good to limit ambiguity. I assume we do a normal cast from 1/4 integer byte values to 8 bytes values rather then just embedding them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants