Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

42 documentation #164

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,4 +162,5 @@ cython_debug/

tests/test_openeo_gfmap/results/
notebook/
data/
data/
docs/
28 changes: 28 additions & 0 deletions documentation/_quarto.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
project:
type: website
output-dir: ../docs
resources:
- "data/**"
website:
sidebar:
style: "floating"
collapse-level: 2
contents:
- section: "Home"
contents:
- text: "Introduction"
href: introduction.qmd
- section: "Sections"
contents:
- text: "Data Fetching"
href: fetching.qmd
- text: "Preprocessing"
href: preprocessing.qmd
- text: "Features"
href: features.qmd
- text: "Inference"
href: inference.qmd
- text: "Managers"
href: managers.qmd
- text: "Utils"
href: utilities.qmd
121 changes: 121 additions & 0 deletions documentation/features.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: "Features extraction"
---

In mapping pipelines, feature extraction typically occurs after the pre-processing of data fetched from the sources. Feature extraction allows to create new features from source data, and reduce timeseries signals in specific features. For example, timeseries Sentinel2 data is can be derived into spectral indices such as NDVI, but also vegetation indices such as Start of Season, Peak of Season, etc...

Additionally, reducing timeseries data into features is an ideal task for foundation models. The flagship project of GFMAP <i>WorldCereal</i> is using [Presto](https://github.com/nasaharvest/presto) to perform data fusion of multiple data sources and to return vectors of 128 features.

Feature extractors are therefore python classes that will be transformed in an UDF, allowing to transform a cube of dimensions `(bands, t, x, y)` to `(features, x, y)`. Essentially, this is a transformation from a 4D cube to a 3D cube. To implement your own feature extractor, you simply need to create your derived class based on `openeo_gfmap.features.feature_extractor.PatchFeatureExtractor`.

### Example of feature extractor

Let's imagine a feature extractor that performs spatial smoothing and computes the median of the signal for the red, green and blue bands and well as the latitude and longitude values.


```python
from openeo_gfmap.features.feature_extractor import PatchFeatureExtractor, apply_feature_extractor

class MyExtractor(PatchFeatureExtractor):

def output_labels(self) -> List[str]:
if self._parameters.get("extract_latlons", False):
return ["red", "green", "blue", "lat", "lon"]
else:
return ["red", "green", "blue"]

def execute(self, inarr: xr.DataArray):
import xarray as xr
from scipy.ndimage import gaussian_filter

# Performs some gaussian filtering to blur the RGB bands
rgb_bands = inarr.sel(bands=["S2-L2A-B04", "S2-L2A-B03", "S2-L2A-B02"])

for band in rgb_bands.bands:
for timestamp in rgb_bands.t:
rgb_bands.loc[{"bands": band, "t": timestamp}] = gaussian_filter(
rgb_bands.loc[{"bands": band, "t": timestamp}], sigma=1.0
)

# Compute the median on the time band
rgb_bands = rgb_bands.median(dim="t").assign_coords(
{"bands": ["red", "green", "blue"]}
)

if self._parameters.get("extract_latlons", False):
# Compute the latlons, feature implemented in the superclass PatchFeatureExtractor
latlons = self.get_latlons(inarr)
return xr.concat([rgb_bands, latlons], dim="bands")

return rgb_bands

```

* The `output_labels` function returns what bands are expected to be returned by the feature extractor. The function can also read custom parameters for the extractor through `self._parameters.get(...)`, allowing to dynamically evaluate what are the expected output labels.

* The execute function is the entrypoint of user code, and is expected to return a 3D `xarray.DataArray` with dimension names (bands, y, x). In this output cube, the `bands` dimensions are actually what we consider `features`, but must be named as `bands` for OpenEO compatibility.

* Imports of libraries should be included in the individual functions or inside the class. The class source code will be read and added in a generated UDF, but code outside the class will not be exported.

* In this example of `MyExtractore.execute` function, we compute the latlons with a function defined in a superclass `PatchFeatureExtractor.get_latlons`.


More advanced Feature Extractors can be found in the [WorldCereal Repository](https://github.com/WorldCereal/worldcereal-classification/blob/main/src/worldcereal/openeo/feature_extractor.py).


Here is an example of executing this feature extraction on a patch:

```python

# We first fetch the data from sources.
from openeo_gfmap import BoundingBoxExtent, SpatialContext, TemporalContext
from openeo_gfmap.backend import BACKEND_CONNECTIONS, Backend, BackendContext, cdse_connection
from openeo_gfmap.fetching import (
FetchType,
build_sentinel2_l2a_extractor,
)

SPATIAL_EXTENT = BoundingBoxExtent(
west=5.05,
south=51.21,
east=5.06,
north=51.22,
epsg=4326,
)

# Recent dates for first extent
TEMPORAL_CONTEXT = ["2023-04-01", "2023-05-01"]

# Bands to extract
bands = [
"S2-L2A-B04",
"S2-L2A-B03",
"S2-L2A-B02",
]

context = BackendContext(Backend.CDSE)

# Creating an OpenEO client connection
connection = cdse_connection()

fetcher = build_sentinel2_l2a_extractor(
backend_context=context, bands=bands, fetch_type=FetchType.TILE
)

cube = fetcher.get_cube(connection, spatial_extent, temporal_extent)

# We then run the extractor
from openeo_gfmap.features.feature_extractor import apply_feature_extractor

# Parameters, can be dinamically changed by user inputs...
parameters = {
"extract_latlons": True
}

features = apply_feature_extractor(MyExtractor, cube, parameters)


# Start the job asynchronously.
```

<i>Note:</i> you can test your Feature Extractor locally using `openeo_gfmap.features.feature_extractor.apply_feature_extractor_local`.
204 changes: 204 additions & 0 deletions documentation/fetching.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
title: "Data Fetching"
---

### Supported collections

The submodule `openeo_gfmap.fetching` provides ways to access data catalogues and perform source-specific preprocessing for multiple collections.

So far, the following collections are well supported, provinding standardized band mappings and preprocessing routines:

* Sentinel 2 L2A
* Sentinel 1 L1C
* Copernicus 30 DEM
* AGERA5

For example, the output band names for Sentinel2 are:

```python
from openeo_gfmap.fetching.s2 import BASE_SENTINEL2_L2A_MAPPING

BASE_SENTINEL2_L2A_MAPPING
```

### Additional collections

Using the generic extractor in `openeo_gfmap.fetching.generic`, you can also provide other collections available in OpenEO backends, or other collections through a STAC catalogue. Those collections band names will not be set to standardized names but instead names as provided in the collection or in the STAC catalogue.

Ideally, GFMAP should support in priority collections that are well used. If an unsupported collections is frequently used, then GFMAP should work to integrate it.

### Different fetch types

GFMAP contains different data types, categorized under the `openeo_gfmap.FetchType` enum:

* Tile based data, which consists of raster data over a continuous area.
* Polygon based data, which is also raster data, but multiple patches are scattered around a region.
* Point base data, where the output is a `VectorCube` and results are saved in a DataFrame.

```python
from openeo_gfmap import FetchType

[value for value in FetchType]
```

Whenever using a collection fetcher, the user is required to specify which Fetch Type will be used.

### Fetcher parameters

Whenever constructing a collection fetcher, additional parameters can be provided in the form of a python dictionnary.

Parameters related to collection loading are provided in a subdictionnary under the `load_collection` key.

| Parameter name | Description | Supported Collections | Example |
|:--------------:|:------------|:----------------------|:--------|
|<i>tileID</i> | Filters on a specific MGRS tile. | Sentinel2 L2A | "20LMR" |
|<i>eo:cloud_cover</i> | Filters on estimated cloud cover, between 0 and 100 | Sentinel2 L2A | `lambda val: val <= 95.0` |
|<i>sat:orbit_state</i> | Filters on satellite orbit direction, either `ASCENDING` or `DESCENDING` | Sentinel1 L1C | `lambda val: val == "ASCENDING" |
|<i>polarisation</i> | Filters on the polarisations present in the product | Sentinel1 L1C | `lambda val: val == "VV&VH"` |

Other collection parameters exists and are dependant on the collection that is being fetched. Refer to the [OpenEO's Python Client API](https://open-eo.github.io/openeo-python-client/api.html#openeo.rest.datacube.DataCube.load_collection) and the [collection properties](https://hub.openeo.org/) for more parameters.

In addition to collection parameters, other parameters are available and related to the source-specific preprocessing operations.

| Parameter name | Description | Supported Collections | Example |
|:--------------:|:------------|:----------------------|:--------|
|<i>tilesize</i> | Size of the tile to load at a time. | <i>All</i> | {"update_arguments": {"feature_flag": {"tilesize": 1}}} |
|<i>target_resolution</i> | Resolution on which to sample the collection, experessed in meters. | <i>All</i> | 10 |
|<i>resampling_method</i> | Method on which to resample. | <i>All</i> | "near" |
|<i>target_crs</i> | CRS code - in EPSG - on which to warp the fetched data. | <i>All</i> | 32631 |
|<i>elevation_model</i> | Elevation model to use to compute the Backscatter. | <i>Sentinel 1</i> | "COPERNICUS_30" |
|<i>coefficient</i> | Backscatter coefficient to compute. | <i>Sentinel 1</i> | "sigma0-ellipsoid" |

### Example: extracting a tile from Sentinel2.

Extracting a continuous tile from the Sentinel 2 collection.

```python

from openeo_gfmap import BoundingBoxExtent, SpatialContext, TemporalContext
from openeo_gfmap.backend import BACKEND_CONNECTIONS, Backend, BackendContext, cdse_connection
from openeo_gfmap.fetching import (
FetchType,
build_sentinel2_l2a_extractor,
)

SPATIAL_EXTENT = BoundingBoxExtent(
west=5.0515130512706845,
south=51.215806593713,
east=5.060320484557499,
north=51.22149744530769,
epsg=4326,
)

# Recent dates for first extent
TEMPORAL_CONTEXT = ["2023-04-01", "2023-05-01"]

# Bands to extract
bands = [
"S2-L2A-B01",
"S2-L2A-B04",
"S2-L2A-B08",
"S2-L2A-B11",
"S2-L2A-SCL",
"S2-L2A-AOT",
]

context = BackendContext(Backend.CDSE)

# Creating an OpenEO client connection
connection = cdse_connection()

fetcher = build_sentinel2_l2a_extractor(
backend_context=context, bands=bands, fetch_type=FetchType.TILE
)

cube = fetcher.get_cube(connection, spatial_extent, temporal_extent)

# Create a job asynchronously or run it synchronously;
```

### Example: extracting sparse polygons from Sentinel2

Extracting sparse polygons from a large extent in Sentinel2, note the parameters used used in this
case.

```python
from pathlib import Path

import geopandas as gpd
import geojson

POINTS_EXTRACTIONS_DF = Path(__file__).parent.parent / "tests/test_openeo_gfmap/resources/puglia_extractions_polygons.gpkg"

# Converting the files to a Lat/LON GeoJSON format
extraction_df = gpd.read_file(POINTS_EXTRACTIONS_DF)
geojson_feats = extraction_df.geometry.__geo_interface__
spatial_context = geojson.GeoJSON(
{"type": "FeatureCollection", "features": geojson_features["features"]}
)

# Alternatively, you can submit a GeoParquet file in an artifactory bucket and use it from a public
# URL. This will allow you to specify polygons in a specific UTM projection.

# Define the temporal context
temporal_context = TemporalContext(
start_date=extraction_df.iloc[0]["start_date"],
end_date=extraction_df.iloc[0]["end_date"],
)

fetching_parameters = {
"target_crs": 3035,
"load_collection": {
"eo:cloud_cover": lambda val: val <= 80.0 # Max 80% cloud cover per product
}
}

fethcer = build_sentinel2_l2a_extractor(
backend_context=context,
bands=bands,
fetch_type=FetchType.POLYGON,
**fetching_parameters,
)

cube = fethcer.get_cube(connection, spatial_context, temporal_context)

# Create a job asynchronously or run it synchronously;
```

### Example: extracting a dataset of points from Sentinel2

Extracting a dataset of points is similar, however you will be required to add manually an aggregation operation after getting the cube from the fetcher.

Whenever extracting points, processing is done as it if was a set of polygons of small size, allowing to use the same processes and workflow in OpenEO than whenever processing tiles or sparse polyongs. To get points, you then need to run an aggregating operation, for example: `cube.aggregate_spatial(spatial_context, reducer="mean")`.

```python

# Dataset of polygons for POINT based extraction
POINT_EXTRACTION_DF = (
Path(__file__).parent.parent / "tests/test_openeo_gfmap/resources/malawi_extraction_polygons.gpkg"
)

extraction_df = gpd.read_file(POINT_EXTRACTION_DF)

# Convert GeoDataFrame to feature collection to build spatial context
geojson_features = extraction_df.geometry.__geo_interface__
spatial_context = geojson.GeoJSON(
{"type": "FeatureCollection", "features": geojson_features["features"]}
)

# Build the temporal context
temporal_context = TemporalContext(
start_date=extraction_df.iloc[0]["start_date"],
end_date=extraction_df.iloc[0]["end_date"],
)

extractor = build_sentinel2_l2a_extractor(
backend_context=context,
bands=bands,
fetch_type=FetchType.POINT,
)

cube = extractor.get_cube(connection, spatial_context, temporal_context)

cube = cube.aggregate_spatial(spatial_context, reducer="mean")
```
15 changes: 15 additions & 0 deletions documentation/inference.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "Model Inference"
---

Once features are computed, models predictions (classification, regression) are done in the Model Inference class.

While being similar in logic on how the [Feature Extractor](features.qmd) is working, the Model Inference doesn't reduce 4 dimensions into 3 but insteads transforms the `features` dimensions in `predictions`.

Essentially, a Model Inference class will load a model run inference and write probabilities. It can also be used to perfrom Postprocessing on predictions, as done in [WorldCereal](https://github.com/WorldCereal/worldcereal-classification/blob/main/src/worldcereal/openeo/postprocess.py).

### Running a model inference.

An example of a ModelInference to perform predictions can be found [here](https://github.com/WorldCereal/worldcereal-classification/blob/main/src/worldcereal/openeo/inference.py#L86). As whenever implementing a `FeatureExtractor`, two functions need to be at minimum implemented: `output_labels` and `execute`. The execute function is expected to return a 3D cube, with the dimensions `(bands, x, y)`.

To then run a model inference class, use the function `openeo_gfmap.inference.model_inferece.apply_model_inference`.
Loading
Loading