From bulk query using GeoParquet to StacCollection #200

ivanhigueram · 2023-03-17T03:09:42Z

ivanhigueram
Mar 17, 2023

Hello all,

I am following the Bulk query feature in the tutorial. Now that I have subset the metadata and selected the items I need, I am trying to transform the assets column dictionary to a pystac.ItemCollection. I have tried to manually create the pystac.Item, but I haven't been able to correctly build the object, also the href seems to be un-signed, so I cannot use the rest of the pipeline used in the tutorials.

There is a way to request the images using the subset of the metadata or should I use the metadata to do individual queries again?

Thanks for your help!

TomAugspurger · 2023-03-19T15:31:05Z

TomAugspurger
Mar 19, 2023

I am trying to transform the assets column dictionary to a pystac.ItemCollection

Good question! It's not well documented (mostly because it's not thoroughly tested), but stac-geoparquet does have a pair of functions to help with this. In theory, stac_geoparquet.to_item_collection will do what you need. In practice, we've had some challenges with faithfully transforming some nested objects (which might be cast to NumPy arrays for pandas) back to STAC-appropriate types.

But even trying in on the initial example from the quickstart fails:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [12], line 3
      1 import stac_geoparquet
----> 3 ic = stac_geoparquet.stac_geoparquet.to_item_collection(df)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/stac_geoparquet/stac_geoparquet.py:119, in to_item_collection(df)
    114 for k in datelike:
    115     df2[k] = (
    116         df2[k].dt.strftime("%Y-%m-%dT%H:%M:%S.%fZ").fillna("").replace({"": None})
    117     )
--> 119 return pystac.ItemCollection(
    120     [to_dict(record) for record in df2.to_dict(orient="records")]
    121 )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pystac/item_collection.py:95, in ItemCollection.__init__(self, items, extra_fields, clone_items)
     92     else:
     93         return pystac.Item.from_dict(item_or_dict, preserve_dict=clone_items)
---> 95 self.items = list(map(map_item, items))
     96 self.extra_fields = extra_fields or {}

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pystac/item_collection.py:93, in ItemCollection.__init__.<locals>.map_item(item_or_dict)
     91     return item_or_dict.clone() if clone_items else item_or_dict
     92 else:
---> 93     return pystac.Item.from_dict(item_or_dict, preserve_dict=clone_items)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pystac/item.py:419, in Item.from_dict(cls, d, href, root, migrate, preserve_dict)
    416 d.pop("type")
    417 d.pop("stac_version")
--> 419 item = cls(
    420     id=id,
    421     geometry=geometry,
    422     bbox=bbox,
    423     datetime=datetime,
    424     properties=properties,
    425     stac_extensions=stac_extensions,
    426     collection=collection_id,
    427     extra_fields=d,
    428     assets={k: Asset.from_dict(v) for k, v in assets.items()},
    429 )
    431 has_self_link = False
    432 for link in links:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pystac/item.py:113, in Item.__init__(self, id, geometry, bbox, datetime, properties, stac_extensions, href, collection, extra_fields, assets)
    100 def __init__(
    101     self,
    102     id: str,
   (...)
    111     assets: Optional[Dict[str, Asset]] = None,
    112 ):
--> 113     super().__init__(stac_extensions or [])
    115     self.id = id
    116     self.geometry = geometry

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The proper fix would be in stac_geoparquet.to_dict to handle that casting (https://github.com/TomAugspurger/stac-geoparquet/issues/3). For now, you can workaround that by manually mutating the DataFrame to convert the ndarrays to lists.

also the href seems to be un-signed, so I cannot use the rest of the pipeline used in the tutorials.

Yes, the URLs in the returned DataFrame are unsigned. Because signing uses short-lived SAS tokens, I'd recommend signing just prior to using. If you have an ItemCollection you can call sign on it. The planetary-computer Python package doesn't currently support signing DataFrames with this structure (microsoft/planetary-computer-sdk-for-python#46).

There is a way to request the images using the subset of the metadata or should I use the metadata to do individual queries again?

Can you expand on this? Do you want the actual bytes in the image files? Or do you just want the URLs to the images? You can use the columns=["assets"] keyword to read_parquet to read just the assets column, and extract the URLs from there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From bulk query using GeoParquet to StacCollection #200

{{title}}

Replies: 1 comment

{{title}}

Select a reply

From bulk query using GeoParquet to StacCollection #200

ivanhigueram Mar 17, 2023

Replies: 1 comment

TomAugspurger Mar 19, 2023

ivanhigueram
Mar 17, 2023

TomAugspurger
Mar 19, 2023