Variant Support for Arrow and Parquet [DRAFT] #7404

PinkCrow007 · 2025-04-11T00:15:05Z

This is a prototype that explores how we may implement Variant types in Parquet, including the structures, and as Arrow extension types. While this work is not yet completed, we (the CMU Variant team) are putting up this PR so everyone has a centralized place to communicate and coordinate work.

Currently, our implementation is set up as follows:

In Arrow, we facilitate Variants using a Canonical Extension Type over binary types.
In Parquet, we add Variant as a Logical Type.
We also add a top-level "arrow-variant" to centralize parsing logic, similar to arrow-json.

Our current goal is to PR a minimum Variant with the intent of adding shredding functionality later. Our current roadmap is as follows:

Round-trip Variant between Arrow and Parquet while preserving data.
Implement Variant binary encoding/decoding.
Round-trip between JSON and Variant.
Retrieve Variant data by key.
Verify Variant binary compatibility with existing implementations.

I am the main engineer working on this project, with support from @mprammer, @pateljm, and @apavlo. This is our team's first Apache PR. We have been in contact with @alamb and @adriangb in the lead-up to this draft PR, and thank them both for their insight and support.

Which issue does this PR close?

[Parquet] Implement Variant type support in Parquet #6736

…ndtrip

adriangb · 2025-04-11T14:38:53Z

This is great thank you so much for putting up the PR!

alamb

First of all, thank you so much @PinkCrow007 (and @mprammer, @pateljm, and @apavlo!)

The quality of the code (and tests and comments!) in this PR far exceeds my own academic code from many 🌔 s ago. Well done

This is our team's first Apache PR.

And quite a nice one it is 👏 . There is a lot of great stuff in this PR. I think it basically lays out the path for how to get a state of the art Variant implementation into arrow-rs (I hope for the "best" Variant implementation)!

I am personally very interested in moving this project along and am willing to devote significant time to helping make it happen.

Next Steps

In order to manage to introduction of a feature with this complexity into a widely used arrow implementation (and selfishly make reviews easier) I recommend we plan to merge the contents of this PR as several smaller individual PRs. Multiple PRs has worked well in the past for this project, and is a common approach for software development in industrial settings. Having a PR like this one showing how all the parts fits together is still crucial I think.

Specifically I recommend breaking the PR into the following sequence. If you agree and think this is reasonable, I can file some additional tickets to organize the work.

Implement Variant (a way to read metadata/data fields) -- so fields can be extracted from individual variant values
Implement the variant extension type and arrays (maybe just struct arrays) allowing easy access to each row as a variant
Implement writing / reading variants to parquet (using the extension types from step 3)

Open questions:

Things that are not 100% clear to me yet are:

Where should the JSON string --> Variant conversion code live and what should the API (@scovich is also working on the same thing in [PATHFINDING] Parse json as variant #7403)
How will shredding work (specifically, will this be modeled somehow in arrow, or should the parquet reader return the underlying shredded columns as individual arrays and leave it to query engines at a higher level to interpret them as a single logical column

alamb · 2025-04-11T15:33:39Z

arrow-array/src/array/variant_array.rs

+    /// assert_eq!(retrieved.value(), b"null");
+    /// ```
+    #[derive(Clone, Debug)]
+    pub struct VariantArray {


One thing that we may want to explore is to model this as a StructArray with two fields, "metadata" and "value" which appears to be how @neilechao modeled it in the C++ implementation that merged a few days ago

GH-45937: [C++][Parquet] Variant logical type definition arrow#45375

Thanks so much @alamb! Your brilliant comment and the C++ implementation really got me thinking — maybe I should tackle this from the ground up.

Currently, in my implementation:

In Arrow: Variant is an ExtensionType over Binary

In Parquet: Variant is implemented as a PrimitiveType (not a GroupType)

It’s a PrimitiveType where metadata and value are concatenated into one binary blob

VariantArray handles the parsing/splitting of this binary data

I initially tried using GroupType but faced issues during arrow_to_parquet conversion. Since the Arrow side is a single Binary ExtensionType, mapping it to a GroupType (which contains two separate binary fields) caused a mismatch in column expectations that was tricky to resolve cleanly.

However, your comment and the C++ design raise an important question about alignment and extensibility. Would it be better to:

Make Variant an ExtensionType on top of Struct type in Arrow

Use GroupType containing two binary fields in Parquet

This aligns with the C++ version and avoids the conversion mismatch. What do you think — does that sound like a better design?

Make Variant an ExtensionType on top of Struct type in Arrow
Use GroupType containing two binary fields in Parquet
This aligns with the C++ version and avoids the conversion mismatch. What do you think — does that sound like a better design?

Yes I think this sounds right. The closer we can keep the arrow implementation to the spec and the other implementations the easier it will be to use in my opinion. If there is good reason to deviate we can discuss that too but in general I think following the existing implementations will make things easier

In terms of GroupType and encodings in Parquet, you may need to re-generate the thrift definitions to pick up the most recent version of the spec: https://github.com/apache/parquet-format/tree/master

We would do that by updating the revision here and then rerunning regen.sh:

arrow-rs/parquet/regen.sh

Line 20 in 390de28

REVISION=5b564f3c47679526cf72e54f207013f28f53acc4

Just to chime in here, I think this is the correct way to go as well. It allows readers to read the metadata without necessarily reading the payload

5b564

Thanks @alamb ! Currently, the main branch of arrow-rs is using the same revision as I did, which doesn’t include the VARIANT logical type yet. I looked into updating to the latest revision from parquet-format, but it introduces new types like GEOMETRY, GEOGRAPHY, and other changes that are not yet supported in the arrow-rs main branch.
So if I stick with the current revision, I’ll need to manually add VARIANT to format.rs. If I upgrade to the latest parquet-format, I’ll also need to patch more places. Let me know if you think there’s a cleaner path forward!

I looked into updating to the latest revision from parquet-format, but it introduces new types like GEOMETRY, GEOGRAPHY, and other changes that are not yet supported in the arrow-rs main branch.

I think it is fine to add the type definitions for those new types.

@etseidl actually has a PR where he updated the thrift definition here: https://github.com/apache/arrow-rs/pull/7408/files#diff-d4b6c8a220629cfe5ac02a85cd537547b2e5c77b01e0793ced98adeb809bcde4R20

Though it is not clear what the status of that PR is.

I will try and find time to write up some tickets with some smaller sub tasks tomorrow

That PR is proof of concept to help get a change to the Parquet spec across the finish line. The format changes there, while including the Variant additions, also have not-yet-released bits. It's probably not worth waiting on those changes, and instead regen off of the last release version of parquet format (2.11.0), which appears to be commit 848302e179d7bb52a64caea6a058b3c08212787c.

alamb · 2025-04-11T15:48:50Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -4431,4 +4431,258 @@ mod tests {
        assert_eq!(c0.len(), c1.len());
        c0.iter().zip(c1.iter()).for_each(|(l, r)| assert_eq!(l, r));
    }
+
+    #[test]
+    #[cfg(feature = "arrow_canonical_extension_types")]


one of the benefits of using the existing StructArray mechanism rather than a new Array type is that we can likely reuse most of the existing parquet writing code, focusing on ensuring the metadata is written/read accurately

+1 for re-using struct array machinery.

It should also make it easier to do granular projection pushdown (only read data for a single field)

alamb · 2025-04-11T15:54:43Z

arrow-schema/src/extension/canonical/variant.rs

+    }
+
+    fn serialize_metadata(&self) -> Option<String> {
+        Some(STANDARD.encode(&self.metadata)) // Encode metadata as STANDARD string


My understanding of the extension metadata for a variant column isn't the same as the metadata for each variant value -- I probably don't fully understand the code but I think the extension type metadata is stored for each field where the variant's metadata is stored for each row (basically the word metadata is overloaded 😢 )

Yeah, you're right. The ExtensionType metadata and the Variant metadata are not related. Right now the ExtensionType metadata isn’t storing anything meaningful.

alamb · 2025-04-11T15:56:24Z

arrow-schema/src/extension/canonical/variant.rs

+/// <https://arrow.apache.org/docs/format/CanonicalExtensions.html#variant>
+#[derive(Debug, Clone, PartialEq)]
+pub struct Variant {
+    metadata: Vec<u8>, // Required binary metadata


One concern I have with this approach is that it will require two allocations (and memory copies) to read a Variant value

I am hoping we can use Rust's borrow checker to do this with pointers (well, slices in rust) and no copying -- something like

pub struct Variant<'a> { metadata: &'a [u8], // Required binary metadata value: &'a [u8], }

alamb · 2025-04-16T10:29:08Z

Here are my planned next steps:

I will start writing up writing up some tickets
I will work on getting example binary data for variant (e.g. Add example Variant data and parquet files parquet-testing#75)

Make Variant GroupType containing two binary fields in Parquet

aihuaxu · 2025-04-18T18:11:32Z

FYI. Variant logical type has been added to Parquet Java (see apache/parquet-java#3072). The annotation will be VARIANT(<specVersion>) as mentioned in the last Parquet sync up. cc @alamb @PinkCrow007 in case
we are adding logical type in this PR.

In Parquet, we add Variant as a Logical Type.

alamb · 2025-04-18T19:02:37Z

FYI I have started breaking the work down into smaller tickets -- here is the list: #6736 (comment)

@PinkCrow007 perhaps you are interested in starting on Variant: Rust API to Create Variant Values #7424

PinkCrow007 · 2025-04-22T21:51:06Z

Thanks everyone for the helpful suggestions! Just a quick update — I’ve adjusted the design to use ExtensionType over Struct in Arrow and GroupType in Parquet. Read/write Parquet, binary encoding, and JSON conversion all continue to work as expected after the change.
Hopefully this can serve as a reference going forward.
I’ll follow the smaller tickets Andrew outlined to continue the work. Thanks again everyone!

Jiaying Li and others added 17 commits April 3, 2025 15:56

schema: add initial Variant type as a Canonical Extension Type

1657aa4

parquet: initial support for LogicalType and ConvertedType for Variant

7b098f1

schema: enforce required value field in Variant

2b659c6

parquet: add support for variant extension type conversion

7796a1e

turn variant into primitive type, add roundtrip test

b47df75

test variant roundtrip with multiple columns RecordBatch; refine variant

0220e97

update variant roundtrip test

36a96ab

add VariantArray and VariantBuilder draft

a8ba629

implement VariantArrayReader and encode_variant_array for variant rou…

0eaa7f0

…ndtrip

update comment

01a2e70

implement get_metadata_length

f245655

modify comments

d81959f

create arrow-variant; implement variant metadata encoding

816d189

implement sorted_string to metadata; encode value draft

83d8048

initial variant encoder and decoder

d8d6dae

add json_variant_parquet_roundtrip test; refine variant <-> json

8de7de5

fix bug

1313697

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 11, 2025

This was referenced Apr 11, 2025

[PATHFINDING] Parse json as variant #7403

Draft

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

Open

alamb reviewed Apr 11, 2025

View reviewed changes

alamb mentioned this pull request Apr 14, 2025

Add example binary variant data and regeneration scripts apache/parquet-testing#76

Merged

4 tasks

PinkCrow007 added 2 commits April 16, 2025 16:06

Merge remote-tracking branch 'upstream/main' into struct_group_type

2096395

Make Variant an ExtensionType over Struct in Arrow;

c1b6bf2

Make Variant GroupType containing two binary fields in Parquet

alamb mentioned this pull request Apr 18, 2025

Variant: Rust API to Read Variant Values #7423

Open

Variant ExtensionType over Struct, GroupType in Parquet

626765e

PinkCrow007 force-pushed the variant-clean branch from d1b4130 to 626765e Compare April 21, 2025 22:10

minor fix

f93c238

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant Support for Arrow and Parquet [DRAFT] #7404

Variant Support for Arrow and Parquet [DRAFT] #7404

PinkCrow007 commented Apr 11, 2025

adriangb commented Apr 11, 2025

alamb left a comment

alamb Apr 11, 2025

PinkCrow007 Apr 15, 2025

alamb Apr 16, 2025

thinkharderdev Apr 16, 2025

PinkCrow007 Apr 17, 2025

alamb Apr 17, 2025

etseidl Apr 17, 2025

alamb Apr 11, 2025

adriangb Apr 16, 2025

alamb Apr 11, 2025

PinkCrow007 Apr 15, 2025

alamb Apr 11, 2025

alamb commented Apr 16, 2025

aihuaxu commented Apr 18, 2025

alamb commented Apr 18, 2025

PinkCrow007 commented Apr 22, 2025

Variant Support for Arrow and Parquet [DRAFT] #7404

Are you sure you want to change the base?

Variant Support for Arrow and Parquet [DRAFT] #7404

Conversation

PinkCrow007 commented Apr 11, 2025

Which issue does this PR close?

adriangb commented Apr 11, 2025

alamb left a comment

Choose a reason for hiding this comment

Next Steps

Open questions:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Apr 16, 2025

aihuaxu commented Apr 18, 2025

alamb commented Apr 18, 2025

PinkCrow007 commented Apr 22, 2025