Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

fix methods to infer geom and bbox #3

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

FlorisCalkoen
Copy link

Hi @TomAugspurger, I recently used stac table to create a STAC collection for a parquet dataset. Doing so I made some minor changes to the package. Please have a look if you would like to keep them.

In the source code I saw some TODO comments about 'maybe converting the geometries to EPSG:4326'. Since my data was in EPSG:3857, I added the reprojection for the geometries that fall directly under the pystac.Item.properties. So the geometries/bbox are now in 4326, whereas the ones under the projection prefix are in the CRS of the source data.

item.bbox # in 4326
item.geometry # in 4326
item.properties["proj:<geom>"] # in projection of source data

For some conditions, the data was being load using dask_geopandas.read_paquet(); but at least for my dataset the spatial partitions were not available without computing them first. What do you think about adding a call to calculate_spatial_partitions()?

        # some condition
        data = dask_geopandas.read_parquet(uri, storage_options=storage_options)
        data.calculate_spatial_partitions()

@TomAugspurger
Copy link
Collaborator

Thanks for the PR!

What do you think about adding a call to calculate_spatial_partitions()?

That sounds reasonable. We'll just want to make sure that .calculate_spatial_partitions is smart about not recomputing the values unless necessary.

I'm not immediately sure why CI failed.

>       assert result.properties["table:columns"] == expected_columns
E       AssertionError: assert [{'name': 'po...'byte_array'}] == [{'name': 'po...'byte_array'}]
E         At index 0 diff: {'name': 'pop_est', 'type': 'double'} != {'name': 'pop_est', 'type': 'int64'}
E         Full diff:
E           [
E         -  {'name': 'pop_est', 'type': 'int64'},
E         ?                               ^^^^^
E         +  {'name': 'pop_est', 'type': 'double'},
E         ?                               ^^^^^^
E            {'name': 'continent', 'type': 'byte_array'},
E            {'name': 'name', 'type': 'byte_array'},
E            {'name': 'iso_a3', 'type': 'byte_array'},
E         -  {'name': 'gdp_md_est', 'type': 'double'},
E         ?                                  ^^^^^^
E         +  {'name': 'gdp_md_est', 'type': 'int64'},
E         ?                                  ^^^^^
E            {'name': 'geometry', 'type': 'byte_array'},
E           ]

Probably unrelated to your change. Maybe pandas used to cast these to float and now it doesn't?

@FlorisCalkoen
Copy link
Author

Hi @TomAugspurger, many thanks for the swift response! So I made the following changes to the PR:

Datatypes

I've changed the expected datetypes in the tests to double for "pop_est" and int64 for "gdp_md_est". Although this matches with what you find in Pandas, it's a bit weird because I would expect population to be an integer and gdp a float?

These are the dtypes in the test data:

print(type(ds))  # <class 'pyarrow.parquet.core._ParquetDatasetV2'>
fragment = ds.fragments[0]

print(ds.schema)

pop_est: double
continent: string
name: string
iso_a3: string
gdp_md_est: int64
geometry: binary
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 945
geo: '{"primary_column": "geometry", "columns": {"geometry": {"encoding":' + 1355


print(fragment.metadata.schema)
<pyarrow._parquet.ParquetSchema object at 0x16cefd140>
required group field_id=-1 schema {
  optional double field_id=-1 pop_est;
  optional binary field_id=-1 continent (String);
  optional binary field_id=-1 name (String);
  optional binary field_id=-1 iso_a3 (String);
  optional int64 field_id=-1 gdp_md_est;
  optional binary field_id=-1 geometry;
  optional int64 field_id=-1 __null_dask_index__;
}



print(gpd.read_parquet("path/to/written/data.parquet").dtypes)  # same as gpd.read_file(gpd.datasets.get_path("naturalearth_lowres")).dtypes

pop_est        float64
continent       object
name            object
iso_a3          object
gdp_md_est       int64
geometry      geometry
dtype: object

Spatial partitions

I've added a condititon that checks if the spatial_partitions are None:

       ...
       if data.spatial_partitions is None:
            data.calculate_spatial_partitions()

Projection EPSG test

One of the tests was checking if the proj:epsg == 4326. Since this proj:epsg can now be other projections, I changed it to check if the EPSG is valid.

def is_valid_epsg(epsg_code):
    try:
        pyproj.CRS.from_user_input(epsg_code)
        return True
    except pyproj.exceptions.CRSError:
        return False

assert is_valid_epsg(result.properties["proj:epsg"])

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants