-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_geodataframe error: All arrays must be of the same length #76
Comments
That's correct in general. That said, stac-geoparquet is mainly focused on the use case where items in a collection are homogenous. Just to confirm, you're using In [41]: rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
In [42]: df = rbr.read_pandas(types_mapper=pd.ArrowDtype)
In [43]: df
Out[43]:
assets bbox collection ... s2:water_percentage sat:orbit_state sat:relative_orbit
0 {'AOT': {'gsd': 10.0, 'href': 'https://sentine... {'xmin': 6.2806415, 'ymin': 47.7341556, 'xmax'... sentinel-2-l2a ... 0.171997 ascending 8
1 {'AOT': {'gsd': 10.0, 'href': 'https://sentine... {'xmin': 6.2806415, 'ymin': 47.7341556, 'xmax'... sentinel-2-l2a ... 0.556677 descending 108
2 {'AOT': {'gsd': 10.0, 'href': 'https://sentine... {'xmin': 5.6670094, 'ymin': 47.6908511, 'xmax'... sentinel-2-l2a ... 0.190585 descending 108
[3 rows x 42 columns] |
That's right, I just updated the example to be clear.
That is working fine well with missing properties (replaced by assets struct<AOT: struct<gsd: double, href: string, ...
bbox struct<xmin: double, ymin: double, xmax: doubl...
collection string[pyarrow]
geometry binary[pyarrow]
id string[pyarrow]
links list<item: struct<href: string, rel: string, t...
...
s2:water_percentage double[pyarrow]
sat:orbit_state string[pyarrow]
sat:relative_orbit int64[pyarrow]
tilename string[pyarrow]
dtype: object However, In order to get a import geopandas as gpd
gs = gpd.GeoSeries.from_wkb(df.geometry)
gdf = gpd.GeoDataFrame(df, geometry=gs) # keeps pyarrow types
gdf.dtypes
Out [80]:
assets struct<AOT: struct<gsd: double, href: string, ...
bbox struct<xmin: double, ymin: double, xmax: doubl...
collection string[pyarrow]
geometry geometry
id string[pyarrow]
# an alternative converting `timestamp` type to `datetime64` and `<NA>` to `NaN`,
# but other specific pyarrow types (string, struct, list) converted to `object` type:
rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
gdf = gpd.GeoDataFrame.from_arrow(rbr)
gdf.dtypes
Out [81]:
assets object
bbox object
collection object
geometry geometry
id object
...
datetime datetime64[us, UTC]
... Any reason for not having That would certainly make it more robust and easier to maintain, wouldn't it? |
Yeah, the implementation in |
Just for memory, my perf tests show that the old way is about 10x faster than the new: import pystac_client
import stac_geoparquet
from time import time
from timeit import timeit
import pandas as pd
import geopandas as gpd
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
)
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=[6.5425, 47.9044, 6.5548, 47.9091],
datetime="2018-01-01/2024-07-20", # starting from 2024-08-11, 's2:dark_features_percentage' was removed
sortby="datetime",
)
coll = search.item_collection()
print(set(coll[0].properties.keys()).symmetric_difference(coll[-1].properties.keys()))
# {'s2:dark_features_percentage'} # this property is missing in coll[1:3], due to a different processing baseline (05.11 instead of 05.10)
nb_runs = 10
#### stac-geoparquet 0.3.2 fills with NaN missing properties
#### stac-geoparquet > 0.3.2 raise an error if properties are missing in some items
def fun1(coll):
records = coll.to_dict()["features"]
gdf = stac_geoparquet.to_geodataframe(records)
return gdf
#### stac-geoparquet 0.6.0
def fun2(coll):
rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
df = rbr.read_pandas(types_mapper=pd.ArrowDtype)
gs = gpd.GeoSeries.from_wkb(df.geometry)
gdf = gpd.GeoDataFrame(df, geometry=gs) # keeps pyarrow types
return gdf
duration = timeit("fun1(coll)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun1: {avg_duration:.4f} seconds')
if stac_geoparquet.__version__ >= '0.6.0':
duration = timeit("fun2(coll)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun2: {avg_duration:.4f} seconds')
print(f"Item-Collection size: {len(coll)}")
# fun1: 0.1396 seconds
# fun2: 1.1298 seconds
# Item-Collection size: 1868 |
It would be interesting to time just rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
table = rbr.read_all() I.e. which part of that is slow? The creation of Arrow data, the conversion to Pandas, or the parsing of geometries? It's also important to note that the newer version is doing a lot more than the original version. For one, it scans the entire input data first to infer a strict schema that represents the full input dataset. It also does a few more transformations to match the latest stac-geoparquet spec. |
@kylebarron, adding the following to the if statement of previous post shows that most of the processing time is due to duration = timeit("fun3(coll)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun3: {avg_duration:.4f} seconds')
rbr = fun3(coll)
duration = timeit("fun4(rbr)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun4: {avg_duration:.4f} seconds')
df = fun4(rbr)
duration = timeit("fun5(df)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun5: {avg_duration:.4f} seconds')
gs = fun5(df)
duration = timeit("fun6(df, gs)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun6: {avg_duration:.4f} seconds')
# fun1: 0.1043 seconds
# fun2: 0.6489 seconds
# fun3: 0.6396 seconds
# fun4: 0.0013 seconds
# fun5: 0.0001 seconds
# fun6: 0.0002 seconds
# Item-Collection size: 1868 |
It may occur that the items of an ItemCollection do not share all the property keys. An example:
In stac-geoparquet <= 3.2, the geodataframe was built from a list of dict, which was introducing NaN where a property was missing. Since commit #fb798f4 (included in version 4.0+), the geodataframe is built from a dict of lists (for acceleration I suppose), thus a missing property in an item makes the operation fail with error at L177:
All arrays must be of the same length
As an ItemCollection cannot garanty that all properties are shared by all items (or am I wrong about that?):
The text was updated successfully, but these errors were encountered: