Skip to content

Add Map support to arrow-avro #7451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jecsand838
Copy link

Which issue does this PR close?

Part of #4886

Related to #6965

Rationale for this change

Avro supports maps as a core data type, but previously arrow-avro lacked any decoding logic to handle them. As a result, any Avro file containing map fields would fail to parse correctly within the Arrow ecosystem. This PR addresses this gap by:

  1. Adding explicit Map decoding: It introduces a new Codec::Map variant and corresponding Decoder::Map logic that reads map blocks in Avro format and constructs an Arrow MapArray. This preserves both the keys and values in a schema-compatible manner, enabling accurate ingestion of map-encoded data.

Overall, these changes expand Arrow’s Avro reader capabilities, allowing users to work with map-encoded data in a standardized Arrow format.

What changes are included in this PR?

This PR adds full Map support in the arrow-avro crate. Key changes include:

1. arrow-avro/src/codec.rs:

  • Added a Codec::Map(Arc<AvroDataType>) variant
  • Logic to construct an Arrow MapArray, including a struct of { key, value } with typed value decoding.
  • Added a new function to AvroDataType.

2. arrow-avro/src/reader/record.rs:

  • Introduced the Map decoding path which leverages blockwise reads of Avro map data.
  • Created helpers read_map_blocks and read_blockwise_items to handle block sequences of key value pairs until a terminating block count of zero.
  • Implemented decoder unit tests for Map types.

Are there any user-facing changes?

N/A

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 28, 2025
@@ -155,6 +168,22 @@ impl Codec {
DataType::List(Arc::new(f.field_with_name(Field::LIST_FIELD_DEFAULT_NAME)))
}
Self::Struct(f) => DataType::Struct(f.iter().map(|x| x.field()).collect()),
Self::Map(value_type) => {
let val_dt = value_type.codec.data_type();
let val_field = Field::new("value", val_dt, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the nullability set to value_type.nullability.is_some()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the spec, did not find that if the value for Map can be null or not, but asked Deepseek, it says that the value can be null if the schema supports it(use Unions such as ["null", "string"]), else it can't be null.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klion26 That's a really good call out. I think it does make sense to add that in here.

ComplexType::Map(m) => {
let val = make_data_type(&m.values, namespace, resolver)?;
Ok(AvroDataType {
nullability: None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set the nullability to val.nullability?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klion26

I was attempting to follow the same pattern used in the other types such as ComplexType::Record(r), ComplexType::Array(a), and ComplexType::Fixed(f).

It seemed from reading the code that nullability would only be set in the Schema::Union(f) branch of make_data_type.

) -> Result<usize, ArrowError> {
let mut total = 0usize;
loop {
let blk = buf.get_long()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems blk is the number of items needed to be read. Is there any case where blk will be negative?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klion26

blk is the block count and there are cases where blk will be negative.

A negative blk is expected by the Avro spec for block-encoded arrays & maps, indicating that the count is -blk items. Avro decoders usually handle this by reading the size marker and then proceeding to decode |blk| entries in that block. After finishing the block, decoding continues with the next block count, until a 0 count terminates the sequence.

Here's the text from the Avro Specification regarding maps with negative block counts:

If a block’s count is negative, its absolute value is used, and the count is followed immediately by a long block size indicating the number of bytes in the block. This block size permits fast skipping through data, e.g., when projecting a record to a subset of its fields.

}

fn avro_from_codec(codec: Codec) -> AvroDataType {
AvroDataType::new(codec, None, Default::default())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to test the nullability to Some(x)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klion26 I was thinking it would be best to handle nullability in a dedicated PR after the basic support for each type is merged in.

…ded `Map` comments and improved `Map` nullability handling in `data_type` in codec.rs
@jecsand838
Copy link
Author

@klion26 I pushed changes just now that:

  • Fixed the linter errors
  • Made read_blockwise_items more readable
  • Changed let val_field = Field::new("value", val_dt, true) to let val_field = Field::new("value", val_dt, value_type.nullability.is_some())

Let me know what you think!

@jecsand838 jecsand838 requested a review from klion26 April 30, 2025 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants