documentation updates

Open-EO · Oct 11, 2023 · 7ebaff2 · 7ebaff2
1 parent 9f69fe3
commit 7ebaff2
Show file tree

Hide file tree

Showing 2 changed files with 59 additions and 7 deletions.
diff --git a/docs/cookbook/sampling.rst b/docs/cookbook/sampling.rst
@@ -2,20 +2,19 @@
 Dataset sampling
 ----------------
 
-EXPERIMENTAL
-
 Tested on:
 
 - Terrascope
+- Copernicus Dataspace Ecosystem
 
 A number of use cases do not require a full datacube to be computed,
 but rather want to extract a result at specific locations.
 Examples include extracting training data for model calibration, or computing the result for
 areas where validation data is available.
 
-Sampling a datacube in openEO currently requires polygons as sampling features. Other types of geometry, like lines
-and points, should be converted into polygons first by applying a buffering operation. Using the spatial resolution
-of the datacube as buffer size can be a way to approximate sampling at a point.
+Sampling can be done for points or polygons:
+- point extractions basically result in a 'vector cube', so can be exported into tabular formats.
+- polygon extractions  can be stored to an individual netCDF per polygon so in this case the output is a sparse raster cube.
 
 To indicate to openEO that we only want to compute the datacube for certain polygon features, we use the
 :func:`~openeo.rest.datacube.DataCube.filter_spatial` method.
@@ -43,3 +42,19 @@ Combining all of this, results in the following sample code::
 
 Sampling only works for batch jobs, because it results in multiple output files, which can not be conveniently transferred
 in a synchronous call.
+
+Performance & scalability
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's important to note that dataset sampling is not necessarily a cheap operation, since creation of a sparse datacube still
+may require accessing a large number of raw EO assets. Backends of course can and should optimize to restrict processing
+to a minimum, but the size of the required input datasets is often a determining factor for cost and performance rather
+than the size of the output dataset.
+
+Sampling at scale
+~~~~~~~~~~~~~~~~~
+
+When doing large scale (e.g. continental) sampling, it is usually not possible or impractical to run it as a single openEO
+batch job. The recommendation here is to apply a spatial grouping to your sampling locations, with a single group covering
+an area of around 100x100km. The optimal size of a group may be backend dependant. Also remember that when working with
+data in the UTM projection, you may want to avoid covering multiple UTM zones in a single group.
diff --git a/docs/udf.rst b/docs/udf.rst
@@ -321,12 +321,35 @@ TODO
 Example: ``reduce_dimension`` with a UDF
 ========================================
 
-TODO
+The key element for a UDF invoked in the context of `reduce_dimension` is that it should actually return
+an XArray DataArray _without_ the dimension that is specified to be reduced.
+
+So a reduce over time would receive a DataArray with `bands,t,y,x` dimensions, and return one with only `bands,y,x`.
+
 
 Example: ``apply_neighborhood`` with a UDF
 ===========================================
 
-TODO
+The apply_neighborhood process is generally used when working with complex AI models that require a
+spatiotemporal input stack with a fixed size. It supports the ability to specify overlap, to ensure that the model
+has sufficient border information to generate a spatially coherent output across chunks of the raster data cube.
+
+In the example below, the UDF will receive chunks of 128x128 pixels: 112 is the chunk size, while 2 times 8 pixels of
+overlap on each side of the chunk results in 128.
+
+The time and band dimensions are not specified, which means that all values along these dimensions are passed into
+the datacube.
+
+
+.. code-block:: python
+
+    output_cube = inputs_cube.apply_neighborhood(my_udf, size=[
+            {'dimension': 'x', 'value': 112, 'unit': 'px'},
+            {'dimension': 'y', 'value': 112, 'unit': 'px'}
+        ], overlap=[
+            {'dimension': 'x', 'value': 8, 'unit': 'px'},
+            {'dimension': 'y', 'value': 8, 'unit': 'px'}
+        ])
 
 Example: Smoothing timeseries with a user defined function (UDF)
 ==================================================================
@@ -383,6 +406,20 @@ For example::
 
 Note: this algorithm's primary purpose is to aid client side development of UDFs using small datasets. It is not designed for large jobs.
 
+UDF dependency management
+=========================
+
+Most UDF's have dependencies, because they often are used to run complex algorithms. Typical dependencies like numpy and
+XArray can be assumed to be available, but others may be more specific for you.
+
+This part is probably the least standardized in the definition of UDF's, and may be backend specific.
+We include some general pointers here:
+
+- Python dependencies can be packaged fairly easily by zipping a Python virtual environment.
+- For some dependencies, it can be important that the Python major version of the virtual environment is the same as the one used by the backend.
+- Python allows you to dynamically append (or prepend) libraries to the search path: `sys.path.append("unzipped_virtualenv_location")`
+
+
 
 Profile a process server-side
 ==============================