Use Thrift macro to generate Parquet LogicalType serialization code#9997
Use Thrift macro to generate Parquet LogicalType serialization code#9997etseidl wants to merge 13 commits into
LogicalType serialization code#9997Conversation
LogicalType serialization code
|
Depends on #9996, so keeping draft for now |
|
run benchmark metadata |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing logical_type_macro (39a4bc1) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark parquet_round_trip |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing logical_type_macro (50c5168) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark metadata |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing logical_type_macro (b9ff17a) to 73c513a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| // An earlier version of this crate enforced this when decoding LogicalType. Now that | ||
| // the decoder is macro generated, we do this to preserve the original behavior. | ||
| // TODO: this was done due to a line in the spec saying an unset algorithm defaults | ||
| // to SPHERICAL. But there is no default in the Thrift, so it would be better to set the | ||
| // default when consumed. | ||
| // See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geography | ||
| for se in schema.iter_mut() { | ||
| match se.logical_type.as_mut() { | ||
| Some(LogicalType::Geography(g)) if g.algorithm.is_none() => { | ||
| g.algorithm = Some(Default::default()); | ||
| } |
There was a problem hiding this comment.
@paleolimbot I'm wondering if this is still necessary, or if this can be enforced where the logical type is consumed. For instance, the one place I could find where removing this logic would have an impact is when handling the extension type.
LogicalType::Geography(geography) => {
let algorithm = geography.algorithm.map(|a| a.try_as_edges()).transpose()?;
let md = parquet_geospatial::WkbMetadata::new(geography.crs.as_deref(), algorithm);
...
}could instead be
LogicalType::Geography(geography) => {
// if algorithm is None, then set to the default (EdgeInterpolationAlgorithm::SPERICAL).
let algorithm = geography.algorithm.or(Some(Default::default()));
let algorithm = algorithm.map(|a| a.try_as_edges()).transpose()?;
let md = parquet_geospatial::WkbMetadata::new(geography.crs.as_deref(), algorithm);
...
}Would removing this cause a lot of thrash downstream?
Which issue does this PR close?
LogicalTypeenum to macro generated version #9995.Rationale for this change
See issue. Improve code maintainability by using thrift macro to generate
LogicalTypeserialization code.What changes are included in this PR?
Adds a new macro to generate code for a Thrift
unionthat needs to be forward compatible. Does this by adding a catchall_Unknownvariant for unknown field ids.Are there any user-facing changes?
Yes this is a breaking API change because the
LogicalTypeenum will now use tuple variants rather than struct. This also makes public some structs that were previously private.