Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2492: Add extension points for all thrift messages #254

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alkis
Copy link

@alkis alkis commented May 29, 2024

Specify a backwards/forward compatible way to extend any Thrift struct in Parquet.

ref Parquet Metadata evolution

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from 2555aa8 to 5f12691 Compare June 1, 2024 09:34
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch 3 times, most recently from 5ef488c to d187886 Compare June 6, 2024 16:16
@alkis alkis changed the title [T3] Add extension points for all thrift messages [PARQUET-2492] Add extension points for all thrift messages Jun 6, 2024
@alkis alkis changed the title [PARQUET-2492] Add extension points for all thrift messages PARQUET-2492: Add extension points for all thrift messages Jun 6, 2024
@alkis alkis force-pushed the t3-metadata-experimentation branch from d187886 to 056429a Compare June 6, 2024 16:37
@alkis alkis marked this pull request as ready for review June 7, 2024 06:46
@alkis
Copy link
Author

alkis commented Jun 7, 2024

I marked this ready for review. I have tested offline that this method works as expected and has virtually no impact in parse speed of the original FileMetaData thrift message.

How can we move this forward?

README.md Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from 056429a to 7994102 Compare June 12, 2024 05:48
README.md Show resolved Hide resolved
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alkis, I left a few comment.
Minor nit: It is nice to add commits to the PR when addressing comments so that we can see the history of when the comment were added. When amending the same commit and force pushing like here, we can't see the history.

Comment on lines +324 to +333
| Magic | Organization |
|-------|--------------|
| `PAR` | Reserved for the future when an extension replaces `PAR1` |
| `PER` | Reserved for the future when an extension replaces `PARE` |
| `ASF` | Apache |
| `AWS` | Amazon |
| `CDH` | Cloudera |
| `CRM` | Salesforce |
| `DBR` | Databricks |
| `EXP` | Apache/Experimental |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply that the extension mechanism is exclusive. We can not have at the same time a new "PAR" metadata format and a proprietary extension.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also,I think we should list here only names that are actually being used and add some information of where they are used and link corresponding documentation. I don't think we need to "reserve" potential future use by a company of their own name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is exclusive because the extension mechanism covers: custom/internal extensions, experimental public extensions, and transition period for new additions to the standard.

I think an example will make this clearer.

The way I see this playing out is:

  1. different vendors add their own extensions in a way that do not interfere with one another. By construction only one extension can be in a file - since a writer is owned by one vendor and will write that vendors extension. The disambiguator string here so that readers that know about one (or more) extensions can check if the extension that is present is one they can read. Some of these extensions will never be made public.
  2. at some point a vendor say NVD wants to make their extension public by including it in the standard. At this point there is a formal proposal to include it in the format as a first class part of metadata.
  3. the extension is reviewed and accepted - possibly in a form slightly different that proposed
  4. the accepted proposal takes magic PAR and is now public. For a transition period the format contains the new format as an extension so that readers have time to be upgraded. This is one of the reasons we pre-reserve PAR now.
  5. once the transition period is over, the extension verbatim becomes the new footer, but now without the thrift field header because it is no longer an extension. At this point we also say that the new magic for the parquet file in this version is PAR\0. This will require very little changes to readers that already supported (4).

Comment on lines +379 to +384
1. A period where the new `FileMetaData` will be written after the old, with a
non-reserved 3 byte magic, say `DBR`.
2. Once the format stabilizes and is considered final, it is brought to the
parquet commitee for ratification.
3. When ratified the extension is moved to an approved state and takes the
reserved 3 byte magic `PAR`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is intended to be a temporary state while the new PAR footer format is being iterated on, maybe we don't need to reserve names like this as "PAR" will be the eventual replacement excluding other names.

It seems to me that this proposal would be better described as a path toward evolving the parquet footer while preserving compatibility for a while. The new footer replacing the existing one eventually.

I think extensibility would be useful as well but those two are not the same thing. By extensibility, I would expect users to have a safe, efficient way to add metadata to the file.

Copy link
Author

@alkis alkis Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explain this in the previous comment. Some extensions will never be made public but they are served by the same mechanism as public extensions in transition to become the new standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this type of proposal, is the "considered final" piece. I think if there is closed development here, some extent the incentives of getting to a new 'PAR' are potentially misaligned.

I like how we are currently iterating in the open on the proposals so everyone has a chance to validate them and provide input. From a community perspective there is obviously nothing that can be done to stop vendors from having custom metadata in the long run, but at least it would be nice to try to provide guidance for the interim process to encourage openness (maybe the table above could also contain an optional link to a public repo/gist so that the community can keep tabs on the different approaches?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants