Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T2] Wide column metadata improvemnts #253

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alkis
Copy link

@alkis alkis commented May 29, 2024

  1. Make ColumnMetaData.type optional
  2. Make ColumnMetaData.path_in_schema optional
  3. Add ColumnMetaData.schema_index. This is the ordinal in FileMetaData.schema this column corresponds to. This allows sparse representation of columns in a rowgroup.
  4. Deprecate ColumnMetaData.encoding_stats and replace with ColumnMetaData.is_fully_dict_encoded.

ref Parquet Metadata evolution

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

1. Make `ColumnMetaData.type` optional
2. Make `ColumnMetaData.path_in_schema` optional
3. Add `ColumnMetaData.schema_index`. This is the ordinal in `FileMetaData.schema` this column corresponds to. This allows sparse representation of columns in a rowgroup.
@alkis alkis force-pushed the t2-metadata-improvements branch from 9f5b94e to f0c75b9 Compare May 30, 2024 10:24
* This implies that ColumnMetaData can be sparse in a rowgroup, if for example
* a column does not have any data pages in a rowgroup.
*/
17: optional i32 schema_index;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I accidentally discovered https://issues.apache.org/jira/browse/PARQUET-183 which can be fixed with this field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants