Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarification on difference between this library and stackstac? #54

Open
rbavery opened this issue Apr 19, 2022 · 11 comments
Open

clarification on difference between this library and stackstac? #54

rbavery opened this issue Apr 19, 2022 · 11 comments

Comments

@rbavery
Copy link

rbavery commented Apr 19, 2022

Hi! I just learned about this library and I'm curious what the feature overlap is with stackstac and what the differences are if someone is familiar with both? which should I use in which scenario? https://github.com/gjoseph92/stackstac

Asking because currently we are considering how to show newcomers to geospatial python how to load imagery into dask-backed xarrays: carpentries-incubator/geospatial-python#102. Any tips appreicated.

@matthewhanson
Copy link

Hello @rbavery ! The intent of them both is the same, which is to create XArrays from STAC Items. The way the XArrays are constructed are slightly different. odc-stac has more capability overall, such as being able to handle any type of CRS over just an EPSG code, but stackstac is a bit simpler to use. odc-stac is under current development, whereas the pace of stackstac is slower.

One thing to note is that as part of the odc-stac effort, odc-geo was released, https://github.com/opendatacube/odc-geo/, which can serve as a foundational library that a library like stackstac may be able to use.

@Kirill888 did some performance benchmarking vs stackstac that you can see here:
https://benchmark-odc-stac-vs-stackstac.netlify.app/
and he may weigh in as well.

Also interested in @gjoseph92's thoughts on this.

@gjoseph92
Copy link

It's a great question. stackstac was a proof-of-concept side project I put together in a month—I had wanted to see if STAC, COG, rasterio, and dask could work together nicely to have STAC->xarray "just work". I'd say it proved the concept. It's been great to see that it's useful to many people. But stackstac isn't my full-time job, so I don't get to work on it that much anymore.

The way the XArrays are constructed are slightly different

I think this is the biggest difference. Pretty much all of my effort in stackstac went to low-level optimization of both parallel data loading with GDAL and dask array construction, specifically for running on clusters. stackstac does a few things:

  • Supports threadsafe parallel reads (I'm not sure how ODC handles concurrency to safely do multithreaded reads)
  • Manages the GDAL cache so multithreaded reading is as fast as possible
  • Constructs the dask graph as efficiently as possible (you can construct graphs of any size in constant time, whereas ODC generates low-level graphs, which will get very slow for large datasets)

One other thing is stackstac.show, which lets you view raster data on a map. But this doesn't actually require stackstac at all; you could use it with odc-stac arrays as well.

odc-stac is probably more robust/reasonable at handling the STAC metadata and picking a default GeoBox. Also, I wish odc-geo had existed when I started stackstac; I would have loved to just use that. This sort of thing has been a weak point in the ecosystem, and hopefully now that there's a package for it, we can all consolidate on that!

Frankly, I don't think we really need to have two packages that do such similar things, and I'd love to see them combined someday. ODC has a lot more legacy, and institutional support, so maybe that'll become the standard. Mostly I care about the low-level optimizations; it would be nice to see those picked up in ODC. I do like the name stackstac though :)

performance benchmarking vs stackstac

These are great benchmarks. The "wide" case isn't quite an accurate comparison, since odc-stac has the groupby feature which allows you to push the mosaicing into the data-loading, instead of doing it after stacking like stackstac does. groupby is a very useful optimization when it works for your use-case, and it's something I've thought about adding to stackstac, but it's a little restrictive (you can't mask by a cloud band, or do some band math, and then groupby, for example).

@Kirill888
Copy link
Member

Kirill888 commented Apr 20, 2022

Some History

While odc-stac itself is relatively new, it is really an offshoot of Open Datacube project that has a long history. In it's current form odc-stac re-uses data loading components of the core datacube library that has been used for data production by Digitial Earth Australia, Digital Earth Africa and CSIRO (Australian national science agency, particularly their EASI hub) and other members of Open Datacube community.

Open Datacube predates STAC and in fact had some influence on the evolution of STAC spec (proj extension in particular), but because of that it has it's own metadata format and it's own metadata management systems. Metadata management part of the open datacube system is the biggest obstacle for adoption. Large organizations listed above have the resources to setup and maintain systems that enable their data scientists to focus on the data experiments and not metadata wrangling, but some phd student trying out some new ml model doesn't have that luxury.

odc-stac does away with all that metadata management hassle by outsourcing all metadata handling to the STAC ecosystem. You don't need to setup any databases or fiddle with metadata indexing, you just get some STAC items from stac api query, and then get back xarray dataset in the same way you would get it from open datacube system. So it was a way to enable convenient access to STAC data for users of open datacube. But odc-stac is useful on it's own, even if you are not part of open datacube ecosystem.

Comparison to stackstac

Both take STAC items and some data loading configuration and produce Dask backed xarrays of raster pixels, but there are some user facing differences that I'll try my best to enumerate

  • Dataset vs Datarray
    • odc-stac returns xarray.Dataset object with each band being a separate array
    • stackstac produces a single xarray.DataArray with band being a dimension within returned array
    • Main advantage of Dataset return type is that bands can have different pixel type, so your mask can stay in uint8 even when other bands are float32
    • Main advantage of DataArray return type is that you can better handle interleaved pixel data sources (RGB)
  • Exposing geo-referencing metadata
    • odc-stac attaches CRS information in a special way that allows it survive various transformations including cropping of the data. This format was co-developed with rioxarray.
    • stackstac uses array attributes to pass along geo-referencing information, array attributes are easy to lose as you perform pixel processing of any kind.
    • odc-stac supports arbitrary WKT
    • stackstac is currently only deals with EPSG codes
    • One can easily convert stackstac loaded array to the same model with rioxarray.set_crs before doing any further processing though, just need to make sure to use center pixel mode when loading data with stackstac.
  • Access to the original STAC metadata
    • odc-stac doesn't really expose any of that, and there is a fundamental design choice that makes it impossible to do in a general case, but we can certainly add it for special case data loading in the future.
    • stackstac exposes all the metadata fields in the returned xarray, combined with delayed computation enabled by Dask this can be very handy as you can leverage all the xarray conveniences to filter out unwanted data.
  • Dealing with overlapping data, grouping of STAC Items into pixel planes
    • odc-stac has native support for grouping multiple STAC items into a single raster plane. You can group by exact timestamp or by "solar day" (date component of the timestamp adjusted for timezone of the data).
    • stackstac always returns one stac item per pixel plane, you can then perform groupby in any way you need and collapse those rasters into one with mosaic operator provided by stackstac library.

Fundamentally stackstac gives you an xarray view of STAC Items, you can more conveniently explore metadata fields than with just pystac.Item or plain dictionaries. Basically it's [pystac.Item] -> PandasDataframe + Delayed Pixels. There is clear one to one mapping from STAC Item to an index within returned xarray.DataArray.

In odc-stac it's all about raster planes instead, given some STAC Items, group them into timestamped pixel planes and expose those pixels after fusing multiple possibly overlapping items in some deterministic order.

^Benchmark report Matt linked earlier reflects that difference in focus nicely. stackstac excels at spatially narrow by temporally deep workflows, while odc-stac is significantly more performant in wide area workflows. If you need to build high resolution mosaic from hundreds of stac items you are better off with odc-stac, as stackstac will be quite a bit slower and more memory hungry. But if you need convenient access to STAC metadata when you are working with "deep time" data and need a clean 1:1 correspondence between STAC items and pixel planes, stackstac is the best choice.

^ I need to update it with results from more recent stackstac Kirill888/benchmark-report#2 there has been some improvements in stackstac lib since, but fundamentally wide area case remains harder for stackstac than for odc-stac.

@Kirill888
Copy link
Member

@gjoseph92 thanks for respone some comments below:

Supports threadsafe parallel reads (I'm not sure how ODC handles concurrency to safely do multithreaded reads)

we use rasterio sharing=False when opening data, added when we first encountered those issues in datacube-core back then, also have external hard-locking when working with netcdf sources, not that matters in the cloud.

Manages the GDAL cache so multithreaded reading is as fast as possible

It's a very handy optimization that I'm planning to use in the next version of odc-stac that does away with datacube dependency.

Constructs the dask graph as efficiently as possible (you can construct graphs of any size in constant time, whereas ODC generates low-level graphs, which will get very slow for large datasets)

I have not benchmarked dask graph construction, but yes I agree that odc does significantly more of work up-front: for every spatial chunk we figure out which stac items contribute to that chunk, in a rather efficient way, but still it does take more time. The advantage is that you end up with a much more compact Dask graph in the "wide" area case. And since graph size has impact on graph submission time overall time from Stac items IN to first pixel OUT can be lower. This also means that empty slots are extremely cheap on the wire and for compute (no need to call to GDAL vrt to get nodata pixels out)

These are great benchmarks. The "wide" case isn't quite an accurate comparison, since odc-stac has the groupby feature which allows you to push the mosaicing into the data-loading, instead of doing it after stacking like stackstac does. groupby is a very useful optimization when it works for your use-case, and it's something I've thought about adding to stackstac, but it's a little restrictive (you can't mask by a cloud band, or do some band math, and then groupby, for example).

That is indeed the reason for better performance in wide area mosaic case. With odc-stac it is possible to disable grouping at construction time and instead perform that fusing in Dask as well. I have not benchmarked that case though. In theory one could detect "empty" slots and optimize Dask graph to avoid unnecessary fusing that way as well. I wish Dask Arrays had sparse mode where you simply don't even populate missing blocks. In odc-stack we know which blocks are just empties, which have just one item and which require fusing several, I think time spent on figuring that out upfront worth it, especially when constructing large mosaics.

@rbavery
Copy link
Author

rbavery commented Apr 20, 2022

Thanks so much for all this context and comparison @matthewhanson @Kirill888 @gjoseph92 . I'm super impressed with both libraries and am excited that so much progress has been made on the problem of munging both wide and tall datasets with xarray.

@idantene
Copy link

idantene commented Jun 6, 2023

These are great benchmarks. The "wide" case isn't quite an accurate comparison, since odc-stac has the groupby feature which allows you to push the mosaicing into the data-loading, instead of doing it after stacking like stackstac does. groupby is a very useful optimization when it works for your use-case, and it's something I've thought about adding to stackstac, but it's a little restrictive (you can't mask by a cloud band, or do some band math, and then groupby, for example).

Sorry for the bump, I'd like to revisit this specifically and ask for clarification (might be off-topic, but I wasn't sure where to bring this up).
The documentation for groupby kwarg does not really explain what it does behind the scene. I could only infer that it perhaps chooses some single pixel as a representative sample (similar maybe to stackstac.mosaic?), but it could just as well do some filtering, averaging, or other arithmetic, and I couldn't find the documentation for this.

Could you elaborate please on what does odc-stack does in this groupby (e.g. in groupby='solar_time') vs how would stackstac achieve the same end result?

Thanks!

EDIT: Upon further inspection I found https://github.com/opendatacube/odc-stac/blob/52a016be2115f180e7059f67bcc6106fbeba7e8d/odc/stac/_load.py#LL701C9-L701C57, which suggests that the first pixel in each day is chosen. Is that indeed the case? Is there additional filtering behind the scenes?

@Kirill888
Copy link
Member

The documentation for groupby kwarg does not really explain what it does behind the scene.

true. Original use case was stitching "true tiles"...

There are really two independent dimensions to groupby

  1. Deciding what stac items should be stitched together into one raster plane
  2. Exactly how this "stitching" should happen
    • merging method
    • precedence of data

Right now merging method is always "first observed valid pixel", with precedence for defining "first" being either

  • Sort by time and id (default)
  • Or "original order" preserve_original_order=True option

My understanding of stackstac.mosaic approach is to basically let user choose what is needed and leverage xarray rich ecosystem of data handling options. The cost of that approach is a significantly increased Dask graph size for "wide" use-case as there are a lot of "nothing to do here" tasks for which there is no proper Dask support (can't tell which chunk is "empty", each one needs to "run" before it can be used by the down-stream code). You can absolutely use the same approach with odc-stac, omit groupby parameter and use xarray.groupby to achieve whatever custom merging logic you want.

By the way PRs for documentation enhancements are welcome :).

@idantene
Copy link

idantene commented Jun 7, 2023

Thanks @Kirill888, that helps a lot! I will hopefully make some PR for documentation once I'm done with our implementation of odc-stack (and stackstac).

If I understand correctly, to mimic groupby='solar_day' in stackstac, one needs to do something like stack.groupby(stack.time.astype('datetime64[D]')).first() and "hope" that the first pixel is indeed valid?
How does odc-stack define a "valid pixel" in "first observed valid pixel"?

@Kirill888
Copy link
Member

If I understand correctly, to mimic groupby='solar_day' in stackstac, one needs to do something like stack.groupby(stack.time.astype('datetime64[D]')).first() and "hope" that the first pixel is indeed valid?

@idantene it's a bit more complicated than that:

  1. time is in UTC so just looking at day component of the timestamp is not enough. We adjust timestamp based on the mean longitude to correct for UTC.
  2. .first() won't deal with missing values, and so will only work with pure tiles without missing pixels.

How does odc-stack define a "valid pixel" in "first observed valid pixel"?

Pixel is valid if it's a finite number (not nan or inf for floating point data) and not equal to nodata attribute if that is set on the source imagery.

@clausmichele
Copy link

Thanks for this nice recap of the differences between the projects. On my side not having all the metadata parsed by odc-stac is what stops me in adopting it compared to stackstac. It is so useful to understand the data content and filtering out what's unnecessary.

@maawoo
Copy link

maawoo commented Mar 18, 2024

I think this recent discussion from the Pangeo Discourse forum should be linked here as well: https://discourse.pangeo.io/t/comparing-odc-stac-load-and-stackstac-for-raster-composite-workflow/4097/13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants