-
Notifications
You must be signed in to change notification settings - Fork 915
Add API for Creating Variant Values #7452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
44ddac2
to
c6c570c
Compare
Thank you so much for this @PinkCrow007 -- I have seen it and plan to review it, but may not have a chance for another day or two. So exciting!@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work! Thanks @PinkCrow007 👍
|
||
/// Encodes an integer value, choosing the smallest sufficient type | ||
pub fn encode_integer(value: i64, output: &mut Vec<u8>) { | ||
if value >= -128 && value <= 127 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if value >= -128 && value <= 127 { | |
if value >= i8::MIN.into() && value <= i8::MAX.into() { |
// Int8 | ||
output.push(primitive_header(VariantPrimitiveType::Int8 as u8)); | ||
output.push(value as u8); | ||
} else if value >= -32768 && value <= 32767 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if value >= -32768 && value <= 32767 { | |
} else if value >= i16::MIN.into() && value <= i16::MAX.into() { |
// Int16 | ||
output.push(primitive_header(VariantPrimitiveType::Int16 as u8)); | ||
output.extend_from_slice(&(value as i16).to_le_bytes()); | ||
} else if value >= -2147483648 && value <= 2147483647 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if value >= -2147483648 && value <= 2147483647 { | |
} else if value >= i32::MIN.into() && value <= i32::MAX.into() { |
/// Timestamp timestamp with time zone 18 TIMESTAMP(isAdjustedToUTC=true, NANOS) 8-byte little-endian | ||
/// TimestampNTZ timestamp without time zone 19 TIMESTAMP(isAdjustedToUTC=false, NANOS) 8-byte little-endian | ||
/// UUID uuid 20 UUID 16-byte big-endian | ||
pub enum VariantPrimitiveType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to add a comment for the specification link here
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
/// Encodes a date value (days since epoch) | ||
pub fn encode_date(value: i32, output: &mut Vec<u8>) { | ||
// Use primitive + date type | ||
let header = primitive_header(VariantPrimitiveType::Date as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} | ||
|
||
/// Encodes a timestamp value (milliseconds since epoch) | ||
pub fn encode_timestamp(value: i64, output: &mut Vec<u8>) { | ||
// Use primitive + timestamp type | ||
let header = primitive_header(VariantPrimitiveType::Timestamp as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} | ||
|
||
/// Encodes a timestamp without timezone value (milliseconds since epoch) | ||
pub fn encode_timestamp_ntz(value: i64, output: &mut Vec<u8>) { | ||
// Use primitive + timestamp_ntz type | ||
let header = primitive_header(VariantPrimitiveType::TimestampNTZ as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} | ||
|
||
/// Encodes a time without timezone value (milliseconds) | ||
pub fn encode_time_ntz(value: i64, output: &mut Vec<u8>) { | ||
// Use primitive + time_ntz type | ||
let header = primitive_header(VariantPrimitiveType::TimeNTZ as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are quite similar. To reduce duplication, could we create a more general encoder function, like encode_general(type_id: VariantPrimitiveType, value: i64, output: &mut Vec<u8>)
? Or create a encoder trait for VariantPrimitiveType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point
I also think they don't need to be pub
(maybe we could start with pub(crate)
) as the main API people would use is the builder I think
]); | ||
assert_eq!(unscaled_value, large_value); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing tests for timestamps, dates, and uuids
//! Variant | ||
//! | ||
//! <https://arrow.apache.org/docs/format/CanonicalExtensions.html#variant> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link is unavailable because the content related to variant
is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @PinkCrow007 -- this is very very nice work and I think it will form the basis of a great API to create variant values
I had some structural suggestions which I left, but the biggest suggestion is that I think the wonderful tests you have written would be improved significantly if we can change them to read back the Variant values that were written in the builders
Here is what I would like to propose to move forward:
- I'll make a new PR with the scaffolding for a
parquet-variant
crate - I'll port some of the code for reading variant values there
Then I think we could add the builder code and tests you have in this PR and add them to the other crate.
But really this is great work
//! # Example | ||
//! | ||
//! ``` | ||
//! use std::io::Cursor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the Cursor is used
@@ -60,6 +60,8 @@ pub enum ArrowError { | |||
DictionaryKeyOverflowError, | |||
/// Error when the run end index in a REE array is bigger than the array length | |||
RunEndIndexOverflowError, | |||
/// Error during Variant operations in `arrow-variant`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add a new variant to this enum, it will be a "breaking API change" as then downstream projects would potentially have to update their code to handle new variants
We make releases with API changes every three months,
https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule
So in other words, it would be great to remove this change from the PR so we can merge it faster.
@@ -37,6 +37,8 @@ mod uuid; | |||
pub use uuid::Uuid; | |||
mod variable_shape_tensor; | |||
pub use variable_shape_tensor::{VariableShapeTensor, VariableShapeTensorMetadata}; | |||
mod variant; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we postpone adding the canonical extension type classes until we get farther along in the process and are in a better position to write tests.
In other words I recommend removing the changes in arrow-schema/src/extension/ as well in this pR
[package] | ||
name = "arrow-variant" | ||
version = { workspace = true } | ||
description = "JSON to Arrow Variant conversion utilities" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description = "JSON to Arrow Variant conversion utilities" | |
description = "Rust API for reading/writing Apache Parquet Variant values" |
arrow-array = { workspace = true } | ||
arrow-buffer = { workspace = true } | ||
arrow-cast = { workspace = true, optional = true } | ||
arrow-data = { workspace = true } | ||
arrow-schema = { workspace = true, features = ["canonical_extension_types"] } | ||
serde = { version = "1.0", default-features = false } | ||
serde_json = { version = "1.0", default-features = false, features = ["std"] } | ||
indexmap = "2.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any of these dependencies are used so we can remove them
arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true, optional = true }
arrow-data = { workspace = true }
serde = { version = "1.0", default-features = false }
serde_json = { version = "1.0", default-features = false, features = ["std"] }
// Verify metadata contains all keys | ||
let keys = get_metadata_keys(&metadata_buffer); | ||
assert_eq!(keys.len(), 11, "Should have 11 keys in metadata"); | ||
assert!(keys.contains(&"null".to_string()), "Missing 'null' key"); | ||
assert!( | ||
keys.contains(&"bool_true".to_string()), | ||
"Missing 'bool_true' key" | ||
); | ||
assert!(keys.contains(&"string".to_string()), "Missing 'string' key"); | ||
|
||
// Verify object has the correct number of entries | ||
// First byte after header is the number of fields (if small object) | ||
assert!(value_buffer.len() > 1, "Value buffer too small"); | ||
let num_fields = value_buffer[1]; | ||
assert_eq!(num_fields as usize, 11, "Object should have 11 fields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than testing these "internal" fields I think the tests would be better if they tested that the resulting value is a readable Variant value. See my next comment below
assert!(!variant.value().is_empty()); | ||
} | ||
|
||
// ========================================================================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very impressive set of test cases 👌
/// Encodes a date value (days since epoch) | ||
pub fn encode_date(value: i32, output: &mut Vec<u8>) { | ||
// Use primitive + date type | ||
let header = primitive_header(VariantPrimitiveType::Date as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} | ||
|
||
/// Encodes a timestamp value (milliseconds since epoch) | ||
pub fn encode_timestamp(value: i64, output: &mut Vec<u8>) { | ||
// Use primitive + timestamp type | ||
let header = primitive_header(VariantPrimitiveType::Timestamp as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} | ||
|
||
/// Encodes a timestamp without timezone value (milliseconds since epoch) | ||
pub fn encode_timestamp_ntz(value: i64, output: &mut Vec<u8>) { | ||
// Use primitive + timestamp_ntz type | ||
let header = primitive_header(VariantPrimitiveType::TimestampNTZ as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} | ||
|
||
/// Encodes a time without timezone value (milliseconds) | ||
pub fn encode_time_ntz(value: i64, output: &mut Vec<u8>) { | ||
// Use primitive + time_ntz type | ||
let header = primitive_header(VariantPrimitiveType::TimeNTZ as u8); | ||
output.push(header); | ||
output.extend_from_slice(&value.to_le_bytes()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point
I also think they don't need to be pub
(maybe we could start with pub(crate)
) as the main API people would use is the builder I think
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! [`arrow-variant`] contains utilities for working with the [Arrow Variant][format] binary format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ 📖
/// Builder API for creating variant values | ||
pub mod builder; | ||
/// Encoder module for converting values to Variant binary format | ||
pub mod encoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start with a minimal API surface area (only expose the Builder and Varaint types directly)
/// Builder API for creating variant values | |
pub mod builder; | |
/// Encoder module for converting values to Variant binary format | |
pub mod encoder; | |
/// Builder API for creating variant values | |
mod builder; | |
/// Encoder module for converting values to Variant binary format | |
mod encoder; |
Thank yoU @Weijun-H for the review as well |
Which issue does this PR close?
Variant: Rust API to Create Variant Values #7424
Rationale for this change
This PR implements a builder-style API for creating Variant values in Rust, following the Variant binary encoding specification. It supports reusing metadata within a single builder session, as discussed in the issue.
What changes are included in this PR?
VariantBuilder
,ObjectBuilder
, andArrayBuilder
for constructing Variant values.Are there any user-facing changes?
This PR adds a new public API for programmatically creating Variant-encoded values.
No breaking changes.
CC: @alamb for visibility