-
Notifications
You must be signed in to change notification settings - Fork 915
Add Map support to arrow-avro #7451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
arrow-avro/src/codec.rs
Outdated
@@ -155,6 +168,22 @@ impl Codec { | |||
DataType::List(Arc::new(f.field_with_name(Field::LIST_FIELD_DEFAULT_NAME))) | |||
} | |||
Self::Struct(f) => DataType::Struct(f.iter().map(|x| x.field()).collect()), | |||
Self::Map(value_type) => { | |||
let val_dt = value_type.codec.data_type(); | |||
let val_field = Field::new("value", val_dt, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the nullability
set to value_type.nullability.is_some()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read the spec, did not find that if the value for Map
can be null or not, but asked Deepseek, it says that the value can be null if the schema supports it(use Unions
such as ["null", "string"]
), else it can't be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klion26 That's a really good call out. I think it does make sense to add that in here.
ComplexType::Map(m) => { | ||
let val = make_data_type(&m.values, namespace, resolver)?; | ||
Ok(AvroDataType { | ||
nullability: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to set the nullability
to val.nullability
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was attempting to follow the same pattern used in the other types such as ComplexType::Record(r)
, ComplexType::Array(a)
, and ComplexType::Fixed(f)
.
It seemed from reading the code that nullability would only be set in the Schema::Union(f)
branch of make_data_type
.
arrow-avro/src/reader/record.rs
Outdated
) -> Result<usize, ArrowError> { | ||
let mut total = 0usize; | ||
loop { | ||
let blk = buf.get_long()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems blk
is the number of items needed to be read. Is there any case where blk
will be negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blk
is the block count and there are cases where blk
will be negative.
A negative blk
is expected by the Avro spec for block-encoded arrays & maps, indicating that the count is -blk
items. Avro decoders usually handle this by reading the size marker and then proceeding to decode |blk|
entries in that block. After finishing the block, decoding continues with the next block count, until a 0 count terminates the sequence.
Here's the text from the Avro Specification regarding maps with negative block counts:
If a block’s count is negative, its absolute value is used, and the count is followed immediately by a long block size indicating the number of bytes in the block. This block size permits fast skipping through data, e.g., when projecting a record to a subset of its fields.
arrow-avro/src/reader/record.rs
Outdated
} | ||
|
||
fn avro_from_codec(codec: Codec) -> AvroDataType { | ||
AvroDataType::new(codec, None, Default::default()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need to test the nullability
to Some(x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klion26 I was thinking it would be best to handle nullability in a dedicated PR after the basic support for each type is merged in.
…ded `Map` comments and improved `Map` nullability handling in `data_type` in codec.rs
@klion26 I pushed changes just now that:
Let me know what you think! |
Which issue does this PR close?
Part of #4886
Related to #6965
Rationale for this change
Avro supports maps as a core data type, but previously arrow-avro lacked any decoding logic to handle them. As a result, any Avro file containing map fields would fail to parse correctly within the Arrow ecosystem. This PR addresses this gap by:
Codec::Map
variant and correspondingDecoder::Map
logic that reads map blocks in Avro format and constructs an Arrow MapArray. This preserves both the keys and values in a schema-compatible manner, enabling accurate ingestion of map-encoded data.Overall, these changes expand Arrow’s Avro reader capabilities, allowing users to work with map-encoded data in a standardized Arrow format.
What changes are included in this PR?
This PR adds full Map support in the arrow-avro crate. Key changes include:
1. arrow-avro/src/codec.rs:
Codec::Map(Arc<AvroDataType>)
variantAvroDataType
.2. arrow-avro/src/reader/record.rs:
read_blockwise_items
to handle block sequences of key value pairs until a terminating block count of zero.Are there any user-facing changes?
N/A