Replies: 3 comments 1 reply
-
Also interested in this topic! @TomAugspurger, perhaps could you comment on how this is done by microsoft planetary computer for datasets like Sentinel-1 or Sentinel-2? According to this thread microsoft/PlanetaryComputer#368 the catalog is updated every 6 hours, but it looks like the geoparquet dump hasn't been updated since 2024-06-24? import pystac_client
import planetary_computer
import adlfs
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1/",
# NOTE: autosigns URLs with SAS tokens, and adds 'credential' to 'table:storage_options'
modifier=planetary_computer.sign_inplace,
)
collection_id = "sentinel-2-l2a"
# STAC API
collection = catalog.get_collection(collection_id)
most_recent_item = next(collection.get_items())
print(most_recent_item.datetime)
# 2024-12-23 03:23:09.024000+00:00
# STAC GEOPARQUET
gpq = collection.assets["geoparquet-items"]
fs = adlfs.AzureBlobFileSystem(**gpq.extra_fields["table:storage_options"])
print(fs.ls(gpq.href)[-1:])
# ['items/sentinel-2-l2a.parquet/part-0469_2024-06-17T10:25:31+00:00_2024-06-24T10:25:31+00:00.parquet'] |
Beta Was this translation helpful? Give feedback.
-
You'd probably want to test out something like Iceberg or Delta Lake to manage ACID transactions on top of Parquet. Otherwise the simplest option would be to regenerate the Parquet files every X hours/days. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback everyone. In the case of a real-time data pipeline, I’ve been thinking that maybe the source of truth should be the GeoParquet and that the STAC JSON Catalog should be the one to lag behind. This would enable processing algorithms, responsible for generating low-latency derived data products, to efficiently access input data in the form of GeoParquet via Delta Lake, Apache Iceberg, etc. Then at some later point, the STAC JSON catalog would be updated based on the GeoParquet containing the most up to date data. Have others used stac-geoparquet in this way? Am I thinking about this all wrong? |
Beta Was this translation helpful? Give feedback.
-
I’m designing geospatial workflows for a non-static STAC catalog where new records/assets are added frequently. My goal is to maintain strong consistency between STAC JSON and GeoParquet formats while leveraging:
Question:
Looking forward to your input!
Beta Was this translation helpful? Give feedback.
All reactions