Open-EO · GriffinBabe · Sep 24, 2024 · Sep 24, 2024 · Sep 24, 2024 · Sep 24, 2024
diff --git a/.gitignore b/.gitignore
@@ -162,4 +162,5 @@ cython_debug/
 
 tests/test_openeo_gfmap/results/
 notebook/
-data/
+data/
+docs/
diff --git a/documentation/_quarto.yaml b/documentation/_quarto.yaml
@@ -0,0 +1,28 @@
+project:
+  type: website
+output-dir: ../docs
+resources:
+  - "data/**"
+website:
+  sidebar:
+    style: "floating"
+    collapse-level: 2
+    contents:
+      - section: "Home"
+        contents:
+          - text: "Introduction"
+            href: introduction.qmd
+      - section: "Sections"
+        contents:
+          - text: "Data Fetching"
+            href: fetching.qmd
+          - text: "Preprocessing"
+            href: preprocessing.qmd
+          - text: "Features"
+            href: features.qmd
+          - text: "Inference"
+            href: inference.qmd
+          - text: "Managers"
+            href: managers.qmd
+          - text: "Utils"
+            href: utilities.qmd
diff --git a/documentation/features.qmd b/documentation/features.qmd
@@ -0,0 +1,121 @@
+---
+title: "Features extraction"
+---
+
+In mapping pipelines, feature extraction typically occurs after the pre-processing of data fetched from the sources. Feature extraction allows to create new features from source data, and reduce timeseries signals in specific features. For example, timeseries Sentinel2 data is can be derived into spectral indices such as NDVI, but also vegetation indices such as Start of Season, Peak of Season, etc...
+
+Additionally, reducing timeseries data into features is an ideal task for foundation models. The flagship project of GFMAP <i>WorldCereal</i> is using [Presto](https://github.com/nasaharvest/presto) to perform data fusion of multiple data sources and to return vectors of 128 features.
+
+Feature extractors are therefore python classes that will be transformed in an UDF, allowing to transform a cube of dimensions `(bands, t, x, y)` to `(features, x, y)`. Essentially, this is a transformation from a 4D cube to a 3D cube. To implement your own feature extractor, you simply need to create your derived class based on `openeo_gfmap.features.feature_extractor.PatchFeatureExtractor`.
+
+### Example of feature extractor
+
+Let's imagine a feature extractor that performs spatial smoothing and computes the median of the signal for the red, green and blue bands and well as the latitude and longitude values.
+
+
+```python
+from openeo_gfmap.features.feature_extractor import PatchFeatureExtractor, apply_feature_extractor
+
+class MyExtractor(PatchFeatureExtractor):
+
+    def output_labels(self) -> List[str]:
+        if self._parameters.get("extract_latlons", False):
+            return ["red", "green", "blue", "lat", "lon"]
+        else:
+            return ["red", "green", "blue"]
+
+    def execute(self, inarr: xr.DataArray):
+        import xarray as xr
+        from scipy.ndimage import gaussian_filter
+
+        # Performs some gaussian filtering to blur the RGB bands
+        rgb_bands = inarr.sel(bands=["S2-L2A-B04", "S2-L2A-B03", "S2-L2A-B02"])
+
+        for band in rgb_bands.bands:
+            for timestamp in rgb_bands.t:
+                rgb_bands.loc[{"bands": band, "t": timestamp}] = gaussian_filter(
+                    rgb_bands.loc[{"bands": band, "t": timestamp}], sigma=1.0
+                )
+
+        # Compute the median on the time band
+        rgb_bands = rgb_bands.median(dim="t").assign_coords(
+            {"bands": ["red", "green", "blue"]}
+        )
+
+        if self._parameters.get("extract_latlons", False):
+            # Compute the latlons, feature implemented in the superclass PatchFeatureExtractor
+            latlons = self.get_latlons(inarr)
+            return xr.concat([rgb_bands, latlons], dim="bands")
+
+        return rgb_bands
+
+```
+
+* The `output_labels` function returns what bands are expected to be returned by the feature extractor. The function can also read custom parameters for the extractor through `self._parameters.get(...)`, allowing to dynamically evaluate what are the expected output labels.
+
+* The execute function is the entrypoint of user code, and is expected to return a 3D `xarray.DataArray` with dimension names (bands, y, x). In this output cube, the `bands` dimensions are actually what we consider `features`, but must be named as `bands` for OpenEO compatibility.
+
+* Imports of libraries should be included in the individual functions or inside the class. The class source code will be read and added in a generated UDF, but code outside the class will not be exported.
+
+* In this example of `MyExtractore.execute` function, we compute the latlons with a function defined in a superclass `PatchFeatureExtractor.get_latlons`.
+
+
+More advanced Feature Extractors can be found in the [WorldCereal Repository](https://github.com/WorldCereal/worldcereal-classification/blob/main/src/worldcereal/openeo/feature_extractor.py).
+
+
+Here is an example of executing this feature extraction on a patch:
+
+```python
+
+# We first fetch the data from sources.
+from openeo_gfmap import BoundingBoxExtent, SpatialContext, TemporalContext
+from openeo_gfmap.backend import BACKEND_CONNECTIONS, Backend, BackendContext, cdse_connection
+from openeo_gfmap.fetching import (
+    FetchType,
+    build_sentinel2_l2a_extractor,
+)
+
+SPATIAL_EXTENT = BoundingBoxExtent(
+    west=5.05,
+    south=51.21,
+    east=5.06,
+    north=51.22,
+    epsg=4326,
+)
+
+# Recent dates for first extent
+TEMPORAL_CONTEXT = ["2023-04-01", "2023-05-01"]
+
+# Bands to extract
+bands = [
+    "S2-L2A-B04",
+    "S2-L2A-B03",
+    "S2-L2A-B02",
+]
+
+context = BackendContext(Backend.CDSE)
+
+# Creating an OpenEO client connection
+connection = cdse_connection()
+
+fetcher = build_sentinel2_l2a_extractor(
+    backend_context=context, bands=bands, fetch_type=FetchType.TILE
+)
+
+cube = fetcher.get_cube(connection, spatial_extent, temporal_extent)
+
+# We then run the extractor
+from openeo_gfmap.features.feature_extractor import apply_feature_extractor
+
+# Parameters, can be dinamically changed by user inputs...
+parameters = {
+    "extract_latlons": True
+}
+
+features = apply_feature_extractor(MyExtractor, cube, parameters)
+
+
+# Start the job asynchronously.
+```
+
+<i>Note:</i> you can test your Feature Extractor locally using `openeo_gfmap.features.feature_extractor.apply_feature_extractor_local`.
diff --git a/documentation/fetching.qmd b/documentation/fetching.qmd
@@ -0,0 +1,204 @@
+---
+title: "Data Fetching"
+---
+
+### Supported collections
+
+The submodule `openeo_gfmap.fetching` provides ways to access data catalogues and perform source-specific preprocessing for multiple collections.
+
+So far, the following collections are well supported, provinding standardized band mappings and preprocessing routines:
+
+* Sentinel 2 L2A
+* Sentinel 1 L1C
+* Copernicus 30 DEM
+* AGERA5
+
+For example, the output band names for Sentinel2 are:
+
+```python
+from openeo_gfmap.fetching.s2 import BASE_SENTINEL2_L2A_MAPPING
+
+BASE_SENTINEL2_L2A_MAPPING
+```
+
+### Additional collections
+
+Using the generic extractor in `openeo_gfmap.fetching.generic`, you can also provide other collections available in OpenEO backends, or other collections through a STAC catalogue. Those collections band names will not be set to standardized names but instead names as provided in the collection or in the STAC catalogue.
+
+Ideally, GFMAP should support in priority collections that are well used. If an unsupported collections is frequently used, then GFMAP should work to integrate it.
+
+### Different fetch types
+
+GFMAP contains different data types, categorized under the `openeo_gfmap.FetchType` enum:
+
+* Tile based data, which consists of raster data over a continuous area.
+* Polygon based data, which is also raster data, but multiple patches are scattered around a region.
+* Point base data, where the output is a `VectorCube` and results are saved in a DataFrame.
+
+```python
+from openeo_gfmap import FetchType
+
+[value for value in FetchType]
+```
+
+Whenever using a collection fetcher, the user is required to specify which Fetch Type will be used.
+
+### Fetcher parameters
+
+Whenever constructing a collection fetcher, additional parameters can be provided in the form of a python dictionnary.
+
+Parameters related to collection loading are provided in a subdictionnary under the `load_collection` key.
+
+| Parameter name | Description | Supported Collections | Example |
+|:--------------:|:------------|:----------------------|:--------|
+|<i>tileID</i>   | Filters on a specific MGRS tile. | Sentinel2 L2A | "20LMR" |
+|<i>eo:cloud_cover</i> | Filters on estimated cloud cover, between 0 and 100 | Sentinel2 L2A | `lambda val: val <= 95.0` | 
+|<i>sat:orbit_state</i> | Filters on satellite orbit direction, either `ASCENDING` or `DESCENDING` | Sentinel1 L1C | `lambda val: val == "ASCENDING" |
+|<i>polarisation</i> | Filters on the polarisations present in the product | Sentinel1 L1C | `lambda val: val == "VV&VH"` |
+
+Other collection parameters exists and are dependant on the collection that is being fetched. Refer to the [OpenEO's Python Client API](https://open-eo.github.io/openeo-python-client/api.html#openeo.rest.datacube.DataCube.load_collection) and the [collection properties](https://hub.openeo.org/) for more parameters.
+
+In addition to collection parameters, other parameters are available and related to the source-specific preprocessing operations.
+
+| Parameter name | Description | Supported Collections | Example |
+|:--------------:|:------------|:----------------------|:--------|
+|<i>tilesize</i> | Size of the tile to load at a time. | <i>All</i> | {"update_arguments": {"feature_flag": {"tilesize": 1}}} |
+|<i>target_resolution</i> | Resolution on which to sample the collection, experessed in meters. | <i>All</i> | 10 |
+|<i>resampling_method</i> | Method on which to resample. | <i>All</i> | "near" |
+|<i>target_crs</i> | CRS code - in EPSG - on which to warp the fetched data. | <i>All</i> | 32631 |
+|<i>elevation_model</i> | Elevation model to use to compute the Backscatter. | <i>Sentinel 1</i> | "COPERNICUS_30" |
+|<i>coefficient</i> | Backscatter coefficient to compute. | <i>Sentinel 1</i> | "sigma0-ellipsoid" |
+
+### Example: extracting a tile from Sentinel2.
+
+Extracting a continuous tile from the Sentinel 2 collection.
+
+```python
+
+from openeo_gfmap import BoundingBoxExtent, SpatialContext, TemporalContext
+from openeo_gfmap.backend import BACKEND_CONNECTIONS, Backend, BackendContext, cdse_connection
+from openeo_gfmap.fetching import (
+    FetchType,
+    build_sentinel2_l2a_extractor,
+)
+
+SPATIAL_EXTENT = BoundingBoxExtent(
+    west=5.0515130512706845,
+    south=51.215806593713,
+    east=5.060320484557499,
+    north=51.22149744530769,
+    epsg=4326,
+)
+
+# Recent dates for first extent
+TEMPORAL_CONTEXT = ["2023-04-01", "2023-05-01"]
+
+# Bands to extract
+bands = [
+    "S2-L2A-B01",
+    "S2-L2A-B04",
+    "S2-L2A-B08",
+    "S2-L2A-B11",
+    "S2-L2A-SCL",
+    "S2-L2A-AOT",
+]
+
+context = BackendContext(Backend.CDSE)
+
+# Creating an OpenEO client connection
+connection = cdse_connection()
+
+fetcher = build_sentinel2_l2a_extractor(
+    backend_context=context, bands=bands, fetch_type=FetchType.TILE
+)
+
+cube = fetcher.get_cube(connection, spatial_extent, temporal_extent)
+
+# Create a job asynchronously or run it synchronously;
+```
+
+### Example: extracting sparse polygons from Sentinel2
+
+Extracting sparse polygons from a large extent in Sentinel2, note the parameters used used in this
+case.
+
+```python
+from pathlib import Path
+
+import geopandas as gpd
+import geojson
+
+POINTS_EXTRACTIONS_DF = Path(__file__).parent.parent / "tests/test_openeo_gfmap/resources/puglia_extractions_polygons.gpkg"
+
+# Converting the files to a Lat/LON GeoJSON format
+extraction_df = gpd.read_file(POINTS_EXTRACTIONS_DF)
+geojson_feats = extraction_df.geometry.__geo_interface__
+spatial_context = geojson.GeoJSON(
+    {"type": "FeatureCollection", "features": geojson_features["features"]}
+)
+
+# Alternatively, you can submit a GeoParquet file in an artifactory bucket and use it from a public
+# URL. This will allow you to specify polygons in a specific UTM projection.
+
+# Define the temporal context
+temporal_context = TemporalContext(
+    start_date=extraction_df.iloc[0]["start_date"],
+    end_date=extraction_df.iloc[0]["end_date"],
+)
+
+fetching_parameters = {
+    "target_crs": 3035,
+    "load_collection": {
+        "eo:cloud_cover": lambda val: val <= 80.0  # Max 80% cloud cover per product
+    }
+}
+
+fethcer = build_sentinel2_l2a_extractor(
+    backend_context=context,
+    bands=bands,
+    fetch_type=FetchType.POLYGON,
+    **fetching_parameters,
+)
+
+cube = fethcer.get_cube(connection, spatial_context, temporal_context)
+
+# Create a job asynchronously or run it synchronously;
+```
+
+### Example: extracting a dataset of points from Sentinel2
+
+Extracting a dataset of points is similar, however you will be required to add manually an aggregation operation after getting the cube from the fetcher.
+
+Whenever extracting points, processing is done as it if was a set of polygons of small size, allowing to use the same processes and workflow in OpenEO than whenever processing tiles or sparse polyongs. To get points, you then need to run an aggregating operation, for example: `cube.aggregate_spatial(spatial_context, reducer="mean")`.
+
+```python
+
+# Dataset of polygons for POINT based extraction
+POINT_EXTRACTION_DF = (
+    Path(__file__).parent.parent / "tests/test_openeo_gfmap/resources/malawi_extraction_polygons.gpkg"
+)
+
+extraction_df = gpd.read_file(POINT_EXTRACTION_DF)
+
+# Convert GeoDataFrame to feature collection to build spatial context
+geojson_features = extraction_df.geometry.__geo_interface__
+spatial_context = geojson.GeoJSON(
+    {"type": "FeatureCollection", "features": geojson_features["features"]}
+)
+
+# Build the temporal context
+temporal_context = TemporalContext(
+    start_date=extraction_df.iloc[0]["start_date"],
+    end_date=extraction_df.iloc[0]["end_date"],
+)
+
+extractor = build_sentinel2_l2a_extractor(
+    backend_context=context,
+    bands=bands,
+    fetch_type=FetchType.POINT,
+)
+
+cube = extractor.get_cube(connection, spatial_context, temporal_context)
+
+cube = cube.aggregate_spatial(spatial_context, reducer="mean")
+```
diff --git a/documentation/inference.qmd b/documentation/inference.qmd
@@ -0,0 +1,15 @@
+---
+title: "Model Inference"
+---
+
+Once features are computed, models predictions (classification, regression) are done in the Model Inference class.
+
+While being similar in logic on how the [Feature Extractor](features.qmd) is working, the Model Inference doesn't reduce 4 dimensions into 3 but insteads transforms the `features` dimensions in `predictions`.
+
+Essentially, a Model Inference class will load a model run inference and write probabilities. It can also be used to perfrom Postprocessing on predictions, as done in [WorldCereal](https://github.com/WorldCereal/worldcereal-classification/blob/main/src/worldcereal/openeo/postprocess.py).
+
+### Running a model inference.
+
+An example of a ModelInference to perform predictions can be found [here](https://github.com/WorldCereal/worldcereal-classification/blob/main/src/worldcereal/openeo/inference.py#L86). As whenever implementing a `FeatureExtractor`, two functions need to be at minimum implemented: `output_labels` and `execute`. The execute function is expected to return a 3D cube, with the dimensions `(bands, x, y)`.
+
+To then run a model inference class, use the function `openeo_gfmap.inference.model_inferece.apply_model_inference`.