Skip to content

Commit

Permalink
documentation updates
Browse files Browse the repository at this point in the history
  • Loading branch information
jdries committed Oct 11, 2023
1 parent 9f69fe3 commit 7ebaff2
Show file tree
Hide file tree
Showing 2 changed files with 59 additions and 7 deletions.
25 changes: 20 additions & 5 deletions docs/cookbook/sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,19 @@
Dataset sampling
----------------

EXPERIMENTAL

Tested on:

- Terrascope
- Copernicus Dataspace Ecosystem

A number of use cases do not require a full datacube to be computed,
but rather want to extract a result at specific locations.
Examples include extracting training data for model calibration, or computing the result for
areas where validation data is available.

Sampling a datacube in openEO currently requires polygons as sampling features. Other types of geometry, like lines
and points, should be converted into polygons first by applying a buffering operation. Using the spatial resolution
of the datacube as buffer size can be a way to approximate sampling at a point.
Sampling can be done for points or polygons:
- point extractions basically result in a 'vector cube', so can be exported into tabular formats.
- polygon extractions can be stored to an individual netCDF per polygon so in this case the output is a sparse raster cube.

To indicate to openEO that we only want to compute the datacube for certain polygon features, we use the
:func:`~openeo.rest.datacube.DataCube.filter_spatial` method.
Expand Down Expand Up @@ -43,3 +42,19 @@ Combining all of this, results in the following sample code::

Sampling only works for batch jobs, because it results in multiple output files, which can not be conveniently transferred
in a synchronous call.

Performance & scalability
~~~~~~~~~~~~~~~~~~~~~~~~~

It's important to note that dataset sampling is not necessarily a cheap operation, since creation of a sparse datacube still
may require accessing a large number of raw EO assets. Backends of course can and should optimize to restrict processing
to a minimum, but the size of the required input datasets is often a determining factor for cost and performance rather
than the size of the output dataset.

Sampling at scale
~~~~~~~~~~~~~~~~~

When doing large scale (e.g. continental) sampling, it is usually not possible or impractical to run it as a single openEO
batch job. The recommendation here is to apply a spatial grouping to your sampling locations, with a single group covering
an area of around 100x100km. The optimal size of a group may be backend dependant. Also remember that when working with
data in the UTM projection, you may want to avoid covering multiple UTM zones in a single group.
41 changes: 39 additions & 2 deletions docs/udf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -321,12 +321,35 @@ TODO
Example: ``reduce_dimension`` with a UDF
========================================

TODO
The key element for a UDF invoked in the context of `reduce_dimension` is that it should actually return
an XArray DataArray _without_ the dimension that is specified to be reduced.

So a reduce over time would receive a DataArray with `bands,t,y,x` dimensions, and return one with only `bands,y,x`.


Example: ``apply_neighborhood`` with a UDF
===========================================

TODO
The apply_neighborhood process is generally used when working with complex AI models that require a
spatiotemporal input stack with a fixed size. It supports the ability to specify overlap, to ensure that the model
has sufficient border information to generate a spatially coherent output across chunks of the raster data cube.

In the example below, the UDF will receive chunks of 128x128 pixels: 112 is the chunk size, while 2 times 8 pixels of
overlap on each side of the chunk results in 128.

The time and band dimensions are not specified, which means that all values along these dimensions are passed into
the datacube.


.. code-block:: python
output_cube = inputs_cube.apply_neighborhood(my_udf, size=[
{'dimension': 'x', 'value': 112, 'unit': 'px'},
{'dimension': 'y', 'value': 112, 'unit': 'px'}
], overlap=[
{'dimension': 'x', 'value': 8, 'unit': 'px'},
{'dimension': 'y', 'value': 8, 'unit': 'px'}
])
Example: Smoothing timeseries with a user defined function (UDF)
==================================================================
Expand Down Expand Up @@ -383,6 +406,20 @@ For example::

Note: this algorithm's primary purpose is to aid client side development of UDFs using small datasets. It is not designed for large jobs.

UDF dependency management
=========================

Most UDF's have dependencies, because they often are used to run complex algorithms. Typical dependencies like numpy and
XArray can be assumed to be available, but others may be more specific for you.

This part is probably the least standardized in the definition of UDF's, and may be backend specific.
We include some general pointers here:

- Python dependencies can be packaged fairly easily by zipping a Python virtual environment.
- For some dependencies, it can be important that the Python major version of the virtual environment is the same as the one used by the backend.
- Python allows you to dynamically append (or prepend) libraries to the search path: `sys.path.append("unzipped_virtualenv_location")`



Profile a process server-side
==============================
Expand Down

0 comments on commit 7ebaff2

Please sign in to comment.