-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2492: Add extension points for all thrift messages #254
base: master
Are you sure you want to change the base?
Conversation
b66b5a8
to
2555aa8
Compare
2555aa8
to
5f12691
Compare
5ef488c
to
d187886
Compare
d187886
to
056429a
Compare
I marked this ready for review. I have tested offline that this method works as expected and has virtually no impact in parse speed of the original How can we move this forward? |
056429a
to
7994102
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alkis, I left a few comment.
Minor nit: It is nice to add commits to the PR when addressing comments so that we can see the history of when the comment were added. When amending the same commit and force pushing like here, we can't see the history.
| Magic | Organization | | ||
|-------|--------------| | ||
| `PAR` | Reserved for the future when an extension replaces `PAR1` | | ||
| `PER` | Reserved for the future when an extension replaces `PARE` | | ||
| `ASF` | Apache | | ||
| `AWS` | Amazon | | ||
| `CDH` | Cloudera | | ||
| `CRM` | Salesforce | | ||
| `DBR` | Databricks | | ||
| `EXP` | Apache/Experimental | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to imply that the extension mechanism is exclusive. We can not have at the same time a new "PAR" metadata format and a proprietary extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also,I think we should list here only names that are actually being used and add some information of where they are used and link corresponding documentation. I don't think we need to "reserve" potential future use by a company of their own name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is exclusive because the extension mechanism covers: custom/internal extensions, experimental public extensions, and transition period for new additions to the standard.
I think an example will make this clearer.
The way I see this playing out is:
- different vendors add their own extensions in a way that do not interfere with one another. By construction only one extension can be in a file - since a writer is owned by one vendor and will write that vendors extension. The disambiguator string here so that readers that know about one (or more) extensions can check if the extension that is present is one they can read. Some of these extensions will never be made public.
- at some point a vendor say
NVD
wants to make their extension public by including it in the standard. At this point there is a formal proposal to include it in the format as a first class part of metadata. - the extension is reviewed and accepted - possibly in a form slightly different that proposed
- the accepted proposal takes magic
PAR
and is now public. For a transition period the format contains the new format as an extension so that readers have time to be upgraded. This is one of the reasons we pre-reservePAR
now. - once the transition period is over, the extension verbatim becomes the new footer, but now without the thrift field header because it is no longer an extension. At this point we also say that the new magic for the parquet file in this version is
PAR\0
. This will require very little changes to readers that already supported (4).
1. A period where the new `FileMetaData` will be written after the old, with a | ||
non-reserved 3 byte magic, say `DBR`. | ||
2. Once the format stabilizes and is considered final, it is brought to the | ||
parquet commitee for ratification. | ||
3. When ratified the extension is moved to an approved state and takes the | ||
reserved 3 byte magic `PAR`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is intended to be a temporary state while the new PAR footer format is being iterated on, maybe we don't need to reserve names like this as "PAR" will be the eventual replacement excluding other names.
It seems to me that this proposal would be better described as a path toward evolving the parquet footer while preserving compatibility for a while. The new footer replacing the existing one eventually.
I think extensibility would be useful as well but those two are not the same thing. By extensibility, I would expect users to have a safe, efficient way to add metadata to the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explain this in the previous comment. Some extensions will never be made public but they are served by the same mechanism as public extensions in transition to become the new standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with this type of proposal, is the "considered final" piece. I think if there is closed development here, some extent the incentives of getting to a new 'PAR' are potentially misaligned.
I like how we are currently iterating in the open on the proposals so everyone has a chance to validate them and provide input. From a community perspective there is obviously nothing that can be done to stop vendors from having custom metadata in the long run, but at least it would be nice to try to provide guidance for the interim process to encourage openness (maybe the table above could also contain an optional link to a public repo/gist so that the community can keep tabs on the different approaches?)
Specify a backwards/forward compatible way to extend any Thrift struct in Parquet.
ref Parquet Metadata evolution
Jira
Commits
Documentation