Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Items with heterogeneous Asset keys are parsed incorrectly #82

Open
scottyhq opened this issue Nov 4, 2024 · 2 comments
Open

Items with heterogeneous Asset keys are parsed incorrectly #82

scottyhq opened this issue Nov 4, 2024 · 2 comments

Comments

@scottyhq
Copy link
Contributor

scottyhq commented Nov 4, 2024

import pystac_client # 0.8.5
import stac_geoparquet  #0.6.0
import geopandas as gpd

client = pystac_client.Client.open(url='https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS')

results = client.search(collections=['ATL03_006'],
                        bbox='-108.34, 38.823, -107.728, 39.19',
                        datetime='2023',
                        method='GET',
                        max_items=5,
)
items = results.item_collection()
record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow(items)
gf = gpd.GeoDataFrame.from_arrow(record_batch_reader)  
gf.assets.iloc[0]

The 'data' asset keys are different for these 5 items, and every item gets a copy of the other keys with None as a value:

{'03/ATL03_20230103090928_02111806_006_02': {'href': 'https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.006/2023.01.03/ATL03_20230103090928_02111806_006_02.h5',
  'roles': array(['data'], dtype=object),
  'title': 'Direct Download'},
 '05/ATL03_20230205073720_07141806_006_02': None,
 '06/ATL03_20230206192127_07371802_006_02': None,
 '06/ATL03_20230306061322_11561806_006_02': None,
 '08/ATL03_20230108204519_02951802_006_02': None,
 'browse': {'href': 'https://n5eil01u.ecs.nsidc.org/DP0/BRWS/Browse.001/2024.04.08/ATL03_20230103090928_02111806_006_02_BRW.h5.images.tide_pole.jpg',
  'roles': array(['browse'], dtype=object),
  'title': 'Download ATL03_20230103090928_02111806_006_02_BRW.h5.images.tide_pole.jpg',
  'type': 'image/jpeg'},

These None entries prevent going back from a dataframe to pystac items:

import pystac
batch = stac_geoparquet.arrow.stac_table_to_items(gf.to_arrow())
items = [pystac.Item.from_dict(x) for x in batch]
File ~/GitHub/uw-cryo/coincident/.pixi/envs/dev/lib/python3.12/site-packages/pystac/asset.py:199, in Asset.from_dict(cls, d)
    193 """Constructs an Asset from a dict.
    194 
    195 Returns:
    196     Asset: The Asset deserialized from the JSON dict.
    197 """
    198 d = copy(d)
--> 199 href = d.pop("href")
    200 media_type = d.pop("type", None)
    201 title = d.pop("title", None)

AttributeError: 'NoneType' object has no attribute 'pop'
@kylebarron
Copy link
Collaborator

In general this is a limitation of Parquet.

JSON has three states: a valid value, null, and a missing/undefined key. Because Parquet is columnar, the third option does not exist here. If one key exists for any item, the entire column for that key name is provisioned.

The default arrow serialization emits None for null arrow values. There was some discussion about this on an issue previously. Perhaps we could add a keyword parameter to stac_table_to_items to remove keys with None. But this would be difficult to do reliably, especially when None is required for some other keys, like datetime, to mean that it has start/end datetime instead.

@scottyhq
Copy link
Contributor Author

scottyhq commented Nov 4, 2024

Right, there is this similar issue for when an outlier Item in a collection is missing an asset common to the group like 'thumbnail' #77.

add a keyword parameter to stac_table_to_items to remove keys with None

Seems convenient. I wonder if it wouldn't be too tricky to only apply to the 'assets' column to avoid complications with datetime?

My quick workaround currently is to just filter the assets column after loading to geopandas:

def filter_assets(assets):
    """ Remove key:None pairs from assets """
    keep_keys = []
    for k,v in assets.items():
        if v is not None:
            keep_keys.append(k)

    return {key: assets[key] for key in keep_keys}

gf['assets'] = gf['assets'].apply(filter_assets)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants