-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework AvroCompat for ParquetType #766
Comments
Have some new thoughts on this - Reading the Parquet spec more closely, a non-grouped repeated field defaults to a required list. We've been relying on this default behavior in Magnolify ParquetType, but we probably should have been always adding a wrapper group to be explicit. I filed and merged a fix for the underlying incompatibility in parquet-mr's AvroSchemaConverter, PARQUET-2425, since technically non-grouped repeated field should be supported 😅 Once the fix is released, IMO we should start moving away from AvroCompat in magnolify-parquet. Ideally we could just modify (1) Update all the reader code to treat non-grouped repeated field schemas as equivalent to required repeated field schemas during read time. This would mean updating Schema.checkCompatibility as well as wrapping repeated schemas into required groups if an Avro list is detected, here (We could just use (2) Eventually update the writer code to produce wrapped repeated schemas, maybe with a fallback option via Thoughts? |
Additionally on the write side, once we've upgraded to Parquet 1.14.2 and the array compatibility is no longer an issue, I don't see any reason not to write the |
Background:
Parquet doesn’t have a single canonical in-memory representation like Avro does; it’s a file format whose read/write layer allows the user to select the specific data format they’d like to read Parquet records into. Parquet-Avro, which belongs to the OSS Parquet library, is one of the most popular. Magnolify-Parquet provides an alternative data format: Scala case classes. Theoretically, data formats can be mixed and matched; you can write using parquet-avro and read into Scala case classes, or vice versa.
The exception to data format interchangeability is when a schema contains a repeated field. Parquet’s MessageType natively supports marking any primitive or complex field as repeated, for example:
By default, this is how Magnolify-Parquet generates schemas for repeated types. However, this protocol for repeated fields is incompatible with the Parquet-Avro format:
As a result, Parquet records containing a repeated field that are written with Parquet-Avro could be read using Magnolify-Parquet, until we introduced AvroCompat to Magnolify. This was a Scala object that, when imported, used an encoding trick to produce repeated schemas compatible with Parquet-Avro:
Problem Statement:
This works, but has some downsides that have become more apparent with increased adoption, namely:
IMO, the Avro compatibility design could be re-worked in a way that’s (a) easy and intuitive for producers to enable, and (b) doesn’t require the data consumer to know about how the upstream was produced. Possibly we should also make Avro-compatible the default way that Magnolify-Parquet writes repeated field fields, so that it’s opt-out, not opt-in, although that’s a potentially big breaking change for users.
The text was updated successfully, but these errors were encountered: