Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map vs Struct type for assets objects #48

Open
kylebarron opened this issue Apr 26, 2024 · 0 comments
Open

Map vs Struct type for assets objects #48

kylebarron opened this issue Apr 26, 2024 · 0 comments

Comments

@kylebarron
Copy link
Collaborator

After some discussion with @bitner I'm exploring some helpers to create default schemas for STAC data. The idea being that if we know a STAC collection has a known set of extensions and a known set of assets, then we should be able to assemble a schema that works for any input data, without needing to do runtime schema inference. Then a second stage of the Parquet conversion could eliminate columns that are defined in the STAC spec but are fully null in the data at hand.

By default, the pieces of information the user would need to provide are the extensions used and the keys of the asset names. But one question here is whether we can simplify this even further by using a Map type instead of a Struct type, likely as an initial step in the Parquet conversion.

Arrow and Parquet have both Map and Struct types. The struct type is akin to a named dictionary while the map type is akin to a list of ordered (key, value) pairs. The key difference is that you need to know up-front the names for the struct type, while you don't for the map type.

So to have a scalable, reliable ingestion process, we could first ingest asset data with a Map type, then quickly infer the keys for the asset names, then make another copy into the struct column layout.

Just a note that these are indeed saved differently in Parquet's schema. With a trivial test case:

import pyarrow as pa
import pyarrow.parquet as pq

d = [{"a": 1}, {"a": 2}]
pa.map_(pa.utf8(), pa.int8())
m_arr = pa.array(d, pa.map_(pa.utf8(), pa.int8()))
s_arr = pa.array(d)

m_table = pa.table({"map": m_arr})
s_table = pa.table({"struct": s_arr})
pq.write_table(m_table, "map.parquet")
pq.write_table(s_table, "struct.parquet")

m_meta = pq.read_metadata("map.parquet")
s_meta = pq.read_metadata("struct.parquet")

m_meta.schema
s_meta.schema

You can see the map-based file has a "map" type in the Parquet schema:

<pyarrow._parquet.ParquetSchema object at 0x12f81d640>
required group field_id=-1 schema {
  optional group field_id=-1 map (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      optional int32 field_id=-1 value (Int(bitWidth=8, isSigned=true));
    }
  }
}

while the other one has a struct type:

<pyarrow._parquet.ParquetSchema object at 0x10a8f9540>
required group field_id=-1 schema {
  optional group field_id=-1 struct {
    optional int64 field_id=-1 a;
  }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant