You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After some discussion with @bitner I'm exploring some helpers to create default schemas for STAC data. The idea being that if we know a STAC collection has a known set of extensions and a known set of assets, then we should be able to assemble a schema that works for any input data, without needing to do runtime schema inference. Then a second stage of the Parquet conversion could eliminate columns that are defined in the STAC spec but are fully null in the data at hand.
By default, the pieces of information the user would need to provide are the extensions used and the keys of the asset names. But one question here is whether we can simplify this even further by using a Map type instead of a Struct type, likely as an initial step in the Parquet conversion.
Arrow and Parquet have both Map and Struct types. The struct type is akin to a named dictionary while the map type is akin to a list of ordered (key, value) pairs. The key difference is that you need to know up-front the names for the struct type, while you don't for the map type.
So to have a scalable, reliable ingestion process, we could first ingest asset data with a Map type, then quickly infer the keys for the asset names, then make another copy into the struct column layout.
Just a note that these are indeed saved differently in Parquet's schema. With a trivial test case:
After some discussion with @bitner I'm exploring some helpers to create default schemas for STAC data. The idea being that if we know a STAC collection has a known set of extensions and a known set of assets, then we should be able to assemble a schema that works for any input data, without needing to do runtime schema inference. Then a second stage of the Parquet conversion could eliminate columns that are defined in the STAC spec but are fully null in the data at hand.
By default, the pieces of information the user would need to provide are the extensions used and the keys of the asset names. But one question here is whether we can simplify this even further by using a Map type instead of a Struct type, likely as an initial step in the Parquet conversion.
Arrow and Parquet have both Map and Struct types. The struct type is akin to a named dictionary while the map type is akin to a list of ordered (key, value) pairs. The key difference is that you need to know up-front the names for the struct type, while you don't for the map type.
So to have a scalable, reliable ingestion process, we could first ingest asset data with a Map type, then quickly infer the keys for the asset names, then make another copy into the struct column layout.
Just a note that these are indeed saved differently in Parquet's schema. With a trivial test case:
You can see the map-based file has a "map" type in the Parquet schema:
while the other one has a
struct
type:The text was updated successfully, but these errors were encountered: