Skip to content

Variant Support for Arrow and Parquet [DRAFT] #7404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

PinkCrow007
Copy link

This is a prototype that explores how we may implement Variant types in Parquet, including the structures, and as Arrow extension types. While this work is not yet completed, we (the CMU Variant team) are putting up this PR so everyone has a centralized place to communicate and coordinate work.

Currently, our implementation is set up as follows:

  • In Arrow, we facilitate Variants using a Canonical Extension Type over binary types.
  • In Parquet, we add Variant as a Logical Type.
  • We also add a top-level "arrow-variant" to centralize parsing logic, similar to arrow-json.

Our current goal is to PR a minimum Variant with the intent of adding shredding functionality later. Our current roadmap is as follows:

  • Round-trip Variant between Arrow and Parquet while preserving data.
  • Implement Variant binary encoding/decoding.
  • Round-trip between JSON and Variant.
  • Retrieve Variant data by key.
  • Verify Variant binary compatibility with existing implementations.

I am the main engineer working on this project, with support from @mprammer, @pateljm, and @apavlo. This is our team's first Apache PR. We have been in contact with @alamb and @adriangb in the lead-up to this draft PR, and thank them both for their insight and support.

Which issue does this PR close?

[Parquet] Implement Variant type support in Parquet #6736

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 11, 2025
@adriangb
Copy link
Contributor

This is great thank you so much for putting up the PR!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you so much @PinkCrow007 (and @mprammer, @pateljm, and @apavlo!)

The quality of the code (and tests and comments!) in this PR far exceeds my own academic code from many 🌔 s ago. Well done

This is our team's first Apache PR.

And quite a nice one it is 👏 . There is a lot of great stuff in this PR. I think it basically lays out the path for how to get a state of the art Variant implementation into arrow-rs (I hope for the "best" Variant implementation)!

I am personally very interested in moving this project along and am willing to devote significant time to helping make it happen.

Next Steps

In order to manage to introduction of a feature with this complexity into a widely used arrow implementation (and selfishly make reviews easier) I recommend we plan to merge the contents of this PR as several smaller individual PRs. Multiple PRs has worked well in the past for this project, and is a common approach for software development in industrial settings. Having a PR like this one showing how all the parts fits together is still crucial I think.

Specifically I recommend breaking the PR into the following sequence. If you agree and think this is reasonable, I can file some additional tickets to organize the work.

  1. Implement Variant (a way to read metadata/data fields) -- so fields can be extracted from individual variant values
  2. Implement the variant extension type and arrays (maybe just struct arrays) allowing easy access to each row as a variant
  3. Implement writing / reading variants to parquet (using the extension types from step 3)

Open questions:

Things that are not 100% clear to me yet are:

  1. Where should the JSON string --> Variant conversion code live and what should the API (@scovich is also working on the same thing in [PATHFINDING] Parse json as variant #7403)
  2. How will shredding work (specifically, will this be modeled somehow in arrow, or should the parquet reader return the underlying shredded columns as individual arrays and leave it to query engines at a higher level to interpret them as a single logical column

/// assert_eq!(retrieved.value(), b"null");
/// ```
#[derive(Clone, Debug)]
pub struct VariantArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that we may want to explore is to model this as a StructArray with two fields, "metadata" and "value" which appears to be how @neilechao modeled it in the C++ implementation that merged a few days ago

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @alamb! Your brilliant comment and the C++ implementation really got me thinking — maybe I should tackle this from the ground up.

Currently, in my implementation:

  1. In Arrow: Variant is an ExtensionType over Binary
  2. In Parquet: Variant is implemented as a PrimitiveType (not a GroupType)
    • It’s a PrimitiveType where metadata and value are concatenated into one binary blob
    • VariantArray handles the parsing/splitting of this binary data

I initially tried using GroupType but faced issues during arrow_to_parquet conversion. Since the Arrow side is a single Binary ExtensionType, mapping it to a GroupType (which contains two separate binary fields) caused a mismatch in column expectations that was tricky to resolve cleanly.

However, your comment and the C++ design raise an important question about alignment and extensibility. Would it be better to:

  1. Make Variant an ExtensionType on top of Struct type in Arrow
  2. Use GroupType containing two binary fields in Parquet

This aligns with the C++ version and avoids the conversion mismatch. What do you think — does that sound like a better design?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make Variant an ExtensionType on top of Struct type in Arrow
Use GroupType containing two binary fields in Parquet
This aligns with the C++ version and avoids the conversion mismatch. What do you think — does that sound like a better design?

Yes I think this sounds right. The closer we can keep the arrow implementation to the spec and the other implementations the easier it will be to use in my opinion. If there is good reason to deviate we can discuss that too but in general I think following the existing implementations will make things easier

In terms of GroupType and encodings in Parquet, you may need to re-generate the thrift definitions to pick up the most recent version of the spec: https://github.com/apache/parquet-format/tree/master

We would do that by updating the revision here and then rerunning regen.sh:

REVISION=5b564f3c47679526cf72e54f207013f28f53acc4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to chime in here, I think this is the correct way to go as well. It allows readers to read the metadata without necessarily reading the payload

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5b564

Thanks @alamb ! Currently, the main branch of arrow-rs is using the same revision as I did, which doesn’t include the VARIANT logical type yet. I looked into updating to the latest revision from parquet-format, but it introduces new types like GEOMETRY, GEOGRAPHY, and other changes that are not yet supported in the arrow-rs main branch.
So if I stick with the current revision, I’ll need to manually add VARIANT to format.rs. If I upgrade to the latest parquet-format, I’ll also need to patch more places. Let me know if you think there’s a cleaner path forward!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into updating to the latest revision from parquet-format, but it introduces new types like GEOMETRY, GEOGRAPHY, and other changes that are not yet supported in the arrow-rs main branch.

I think it is fine to add the type definitions for those new types.

@etseidl actually has a PR where he updated the thrift definition here: https://github.com/apache/arrow-rs/pull/7408/files#diff-d4b6c8a220629cfe5ac02a85cd537547b2e5c77b01e0793ced98adeb809bcde4R20

Though it is not clear what the status of that PR is.

I will try and find time to write up some tickets with some smaller sub tasks tomorrow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR is proof of concept to help get a change to the Parquet spec across the finish line. The format changes there, while including the Variant additions, also have not-yet-released bits. It's probably not worth waiting on those changes, and instead regen off of the last release version of parquet format (2.11.0), which appears to be commit 848302e179d7bb52a64caea6a058b3c08212787c.

@@ -4431,4 +4431,258 @@ mod tests {
assert_eq!(c0.len(), c1.len());
c0.iter().zip(c1.iter()).for_each(|(l, r)| assert_eq!(l, r));
}

#[test]
#[cfg(feature = "arrow_canonical_extension_types")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the benefits of using the existing StructArray mechanism rather than a new Array type is that we can likely reuse most of the existing parquet writing code, focusing on ensuring the metadata is written/read accurately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for re-using struct array machinery.

It should also make it easier to do granular projection pushdown (only read data for a single field)

}

fn serialize_metadata(&self) -> Option<String> {
Some(STANDARD.encode(&self.metadata)) // Encode metadata as STANDARD string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the extension metadata for a variant column isn't the same as the metadata for each variant value -- I probably don't fully understand the code but I think the extension type metadata is stored for each field where the variant's metadata is stored for each row (basically the word metadata is overloaded 😢 )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. The ExtensionType metadata and the Variant metadata are not related. Right now the ExtensionType metadata isn’t storing anything meaningful.

/// <https://arrow.apache.org/docs/format/CanonicalExtensions.html#variant>
#[derive(Debug, Clone, PartialEq)]
pub struct Variant {
metadata: Vec<u8>, // Required binary metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have with this approach is that it will require two allocations (and memory copies) to read a Variant value

I am hoping we can use Rust's borrow checker to do this with pointers (well, slices in rust) and no copying -- something like

pub struct Variant<'a> {
    metadata: &'a [u8], // Required binary metadata
    value: &'a [u8],
}

@alamb
Copy link
Contributor

alamb commented Apr 16, 2025

Here are my planned next steps:

  1. I will start writing up writing up some tickets
  2. I will work on getting example binary data for variant (e.g. Add example Variant data and parquet files parquet-testing#75)

@aihuaxu
Copy link

aihuaxu commented Apr 18, 2025

FYI. Variant logical type has been added to Parquet Java (see apache/parquet-java#3072). The annotation will be VARIANT(<specVersion>) as mentioned in the last Parquet sync up. cc @alamb @PinkCrow007 in case
we are adding logical type in this PR.

In Parquet, we add Variant as a Logical Type.

@alamb
Copy link
Contributor

alamb commented Apr 18, 2025

FYI I have started breaking the work down into smaller tickets -- here is the list: #6736 (comment)

@PinkCrow007
Copy link
Author

Thanks everyone for the helpful suggestions! Just a quick update — I’ve adjusted the design to use ExtensionType over Struct in Arrow and GroupType in Parquet. Read/write Parquet, binary encoding, and JSON conversion all continue to work as expected after the change.
Hopefully this can serve as a reference going forward.
I’ll follow the smaller tickets Andrew outlined to continue the work. Thanks again everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants