-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clarification on difference between this library and stackstac? #54
Comments
Hello @rbavery ! The intent of them both is the same, which is to create XArrays from STAC Items. The way the XArrays are constructed are slightly different. odc-stac has more capability overall, such as being able to handle any type of CRS over just an EPSG code, but stackstac is a bit simpler to use. odc-stac is under current development, whereas the pace of stackstac is slower. One thing to note is that as part of the odc-stac effort, odc-geo was released, https://github.com/opendatacube/odc-geo/, which can serve as a foundational library that a library like stackstac may be able to use. @Kirill888 did some performance benchmarking vs stackstac that you can see here: Also interested in @gjoseph92's thoughts on this. |
It's a great question. stackstac was a proof-of-concept side project I put together in a month—I had wanted to see if STAC, COG, rasterio, and dask could work together nicely to have STAC->xarray "just work". I'd say it proved the concept. It's been great to see that it's useful to many people. But stackstac isn't my full-time job, so I don't get to work on it that much anymore.
I think this is the biggest difference. Pretty much all of my effort in stackstac went to low-level optimization of both parallel data loading with GDAL and dask array construction, specifically for running on clusters. stackstac does a few things:
One other thing is odc-stac is probably more robust/reasonable at handling the STAC metadata and picking a default GeoBox. Also, I wish odc-geo had existed when I started stackstac; I would have loved to just use that. This sort of thing has been a weak point in the ecosystem, and hopefully now that there's a package for it, we can all consolidate on that! Frankly, I don't think we really need to have two packages that do such similar things, and I'd love to see them combined someday. ODC has a lot more legacy, and institutional support, so maybe that'll become the standard. Mostly I care about the low-level optimizations; it would be nice to see those picked up in ODC. I do like the name stackstac though :)
These are great benchmarks. The "wide" case isn't quite an accurate comparison, since odc-stac has the |
Some HistoryWhile Open Datacube predates STAC and in fact had some influence on the evolution of STAC spec (proj extension in particular), but because of that it has it's own metadata format and it's own metadata management systems. Metadata management part of the open datacube system is the biggest obstacle for adoption. Large organizations listed above have the resources to setup and maintain systems that enable their data scientists to focus on the data experiments and not metadata wrangling, but some phd student trying out some new ml model doesn't have that luxury.
Comparison to stackstacBoth take STAC items and some data loading configuration and produce Dask backed xarrays of raster pixels, but there are some user facing differences that I'll try my best to enumerate
Fundamentally In ^Benchmark report Matt linked earlier reflects that difference in focus nicely. ^ I need to update it with results from more recent |
@gjoseph92 thanks for respone some comments below:
we use rasterio
It's a very handy optimization that I'm planning to use in the next version of
I have not benchmarked dask graph construction, but yes I agree that odc does significantly more of work up-front: for every spatial chunk we figure out which stac items contribute to that chunk, in a rather efficient way, but still it does take more time. The advantage is that you end up with a much more compact Dask graph in the "wide" area case. And since graph size has impact on graph submission time overall time from Stac items IN to first pixel OUT can be lower. This also means that empty slots are extremely cheap on the wire and for compute (no need to call to GDAL vrt to get
That is indeed the reason for better performance in wide area mosaic case. With |
Thanks so much for all this context and comparison @matthewhanson @Kirill888 @gjoseph92 . I'm super impressed with both libraries and am excited that so much progress has been made on the problem of munging both wide and tall datasets with xarray. |
Sorry for the bump, I'd like to revisit this specifically and ask for clarification (might be off-topic, but I wasn't sure where to bring this up). Could you elaborate please on what does Thanks! EDIT: Upon further inspection I found https://github.com/opendatacube/odc-stac/blob/52a016be2115f180e7059f67bcc6106fbeba7e8d/odc/stac/_load.py#LL701C9-L701C57, which suggests that the first pixel in each day is chosen. Is that indeed the case? Is there additional filtering behind the scenes? |
true. Original use case was stitching "true tiles"... There are really two independent dimensions to
Right now merging method is always "first observed valid pixel", with precedence for defining "first" being either
My understanding of By the way PRs for documentation enhancements are welcome :). |
Thanks @Kirill888, that helps a lot! I will hopefully make some PR for documentation once I'm done with our implementation of If I understand correctly, to mimic |
@idantene it's a bit more complicated than that:
Pixel is valid if it's a finite number (not |
Thanks for this nice recap of the differences between the projects. On my side not having all the metadata parsed by |
I think this recent discussion from the Pangeo Discourse forum should be linked here as well: https://discourse.pangeo.io/t/comparing-odc-stac-load-and-stackstac-for-raster-composite-workflow/4097/13 |
Hi! I just learned about this library and I'm curious what the feature overlap is with stackstac and what the differences are if someone is familiar with both? which should I use in which scenario? https://github.com/gjoseph92/stackstac
Asking because currently we are considering how to show newcomers to geospatial python how to load imagery into dask-backed xarrays: carpentries-incubator/geospatial-python#102. Any tips appreicated.
The text was updated successfully, but these errors were encountered: