Release version 1.5.0, Merge pull request #234 from sentinel-hub/develop

Release version 1.5.0
sentinel-hub · Apr 25, 2023 · 0e2fa52 · 0e2fa52
2 parents ec43cd0 + 0cb02b8
commit 0e2fa52
Show file tree

Hide file tree

Showing 163 changed files with 4,672 additions and 5,452 deletions.
diff --git a/.coveragerc b/.coveragerc
diff --git a/.github/workflows/ci_action.yml b/.github/workflows/ci_action.yml
@@ -51,7 +51,7 @@ jobs:
 
       - name: Install packages
         run: |
-          pip install -e .[DEV]
+          pip install -e .[DEV,ML]
 
       - name: Run mypy
         run: |
@@ -88,7 +88,7 @@ jobs:
           sudo apt-get install -y build-essential gdal-bin libgdal-dev graphviz proj-bin gcc libproj-dev libspatialindex-dev
           export CPLUS_INCLUDE_PATH=/usr/include/gdal
           export C_INCLUDE_PATH=/usr/include/gdal
-          pip install -e .[DEV]
+          pip install -e .[DEV,ML]
           pip install gdal==$(gdal-config --version | awk -F'[.]' '{print $1"."$2}')
 
       - name: Run fast tests

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -12,20 +12,27 @@ repos:
       - id: check-merge-conflict
       - id: debug-statements
 
+  - repo: https://github.com/pre-commit/mirrors-prettier
+    rev: "v3.0.0-alpha.6"
+    hooks:
+      - id: prettier
+        exclude: "tests/(test_stats|test_project)/"
+        types_or: [json]
+
   - repo: https://github.com/psf/black
-    rev: 22.12.0
+    rev: 23.3.0
     hooks:
       - id: black
         language_version: python3
 
   - repo: https://github.com/pycqa/isort
-    rev: 5.11.4
+    rev: 5.12.0
     hooks:
       - id: isort
         name: isort (python)
 
   - repo: https://github.com/PyCQA/autoflake
-    rev: v2.0.0
+    rev: v2.0.2
     hooks:
       - id: autoflake
         args:
@@ -40,13 +47,13 @@ repos:
     hooks:
       - id: flake8
         additional_dependencies:
-          - flake8-bugbear
-          - flake8-comprehensions
-          - flake8-simplify
-          - flake8-typing-imports
+          - flake8-bugbear==23.2.13
+          - flake8-comprehensions==3.10.1
+          - flake8-simplify==0.19.3
+          - flake8-typing-imports==1.14.0
 
   - repo: https://github.com/nbQA-dev/nbQA
-    rev: 1.6.0
+    rev: 1.7.0
     hooks:
       - id: nbqa-black
       - id: nbqa-isort

diff --git a/MANIFEST.in b/MANIFEST.in
diff --git a/Makefile b/Makefile
@@ -1,17 +1,15 @@
 # Makefile for creating a new release of the package and uploading it to PyPI
 
-PYTHON = python3
-
 help:
 	@echo "Use 'make upload' to upload the package to PyPI"
 
 upload:
 	rm -r dist | true
-	$(PYTHON) setup.py sdist bdist_wheel
+	python -m build --sdist --wheel
 	twine upload --skip-existing dist/*
 
 # For testing:
 test-upload:
 	rm -r dist | true
-	$(PYTHON) setup.py sdist bdist_wheel
+	python -m build --sdist --wheel
 	twine upload --repository testpypi --skip-existing dist/*
diff --git a/README.md b/README.md
@@ -65,6 +65,8 @@ Running pipelines is easiest by using the CLI provided by **`eo-grow`**. For all
 
 ## Documentation
 
+For more information on the package visit [readthedocs](https://eo-grow.readthedocs.io/en/latest/).
+
 Explanatory examples can be found [here](https://github.com/sentinel-hub/eo-grow/tree/main/examples).
 
 More details on the config language used by **`eo-grow`** can be found [here](https://github.com/sentinel-hub/eo-grow/tree/main/docs/source/config-language.md).

diff --git a/docs/source/common-configuration-patterns.md b/docs/source/common-configuration-patterns.md
@@ -0,0 +1,202 @@
+# Common Configuration Patterns
+
+## Using config templates
+
+When you need to write a config for a pipeline, you can avoid rummaging through documentation by using the template command `eogrow-template`.
+
+Invoking `eogrow-template "eogrow.pipelines.zipmap.ZipMapPipeline" "zipmap.json"` creates a file with the content:
+```json
+{
+  "pipeline": "eogrow.pipelines.zipmap.ZipMapPipeline",
+  "pipeline_name": "<< Optional[str] >>",
+  "workers": "<< 1 : int >>",
+  "use_ray": "<< 'auto' : Union[Literal['auto'], bool] >>",
+  "input_features": {
+    "<< type >>": "List[InputFeatureSchema]",
+    "<< nested schema >>": "<class 'eogrow.pipelines.zipmap.InputFeatureSchema'>",
+    "<< sub-template >>": {
+      "feature": "<< Tuple[FeatureType, str] >>",
+      "folder_key": "<< str >>",
+      "include_bbox_and_timestamp": "<< True : bool >>"
+    }
+  },
+  ...
+}
+```
+You can now remove any parameters you do not need and fill out the rest.
+
+Parameter values are of form `"<< default : type >>"`, or `"<< default : type // description >>"` if you use the `--add-description` flag.
+
+The parameters are in order of definition, causing `ZipMap` specific parameters come at the end (we switched the order a bit in the example).
+
+In cases of nested schema, you get the output as the above for `"input_features"` which tells you what the type of the nesting is, and the template for the nested pydantic model.
+
+For managers the template does not provide a schema directly, but the functionality is not restricted to pipelines, you can also invoke `eogrow-template "eogrow.core.logging.LoggingManager" "logging_manager.json"` to get templates for the logging manager.
+
+## Global config
+
+Most of the configuration files have a lot in common. This tends to be especially true for fields describing managers:
+- `area`
+- `storage`
+- `logging`
+
+From our experience, it is sometimes easiest to create a so-called *global configuration*, which contains all such fields.
+
+```
+{  // global_config.json
+  "area": {
+    ...
+  },
+  "storage": {
+    ...
+  },
+  "logging": {
+    ...
+  }
+}
+```
+
+This is then used in pipeline configurations.
+
+```
+{ // export.json
+  "pipeline": "eogrow.pipelines.export_maps.ExportMapsPipeline",
+  "**global_config": "${config_path}/global_config.json",
+  "feature": ["data", "BANDS"],
+  "map_dtype": "int16",
+  "cogify": true,
+  ...
+}
+```
+
+This keeps pipeline configs shorter and more readable. One can also use multiple such files, for instance one for each manager. This makes it easy to have pipelines that work on different resolutions, where it's possible to just switch between `"**area_config": "${config_path}/area_10m.json"` and `"**area_config": "${config_path}/area_30m.json"`.
+
+How fine-grained your config structure becomes is usually project-specific. Spreading it too thinly makes it harder to follow what precisely will be in the end config.
+
+### Adjusting settings from the global config
+
+In some cases, the settings from a global config (or from a different config file that you are importing) need to be overridden. Imagine that a pipeline produces a ton of useless warnings, and you only wish to ignore them for that specific pipeline.
+
+```
+{ // export.json
+  "pipeline": "eogrow.pipelines.export_maps.ExportMapsPipeline",
+  "**global_config": "${config_path}/global_config.json",
+  "logging": {
+    "capture_warnings": false
+  },
+  "feature": ["data", "BANDS"],
+  "map_dtype": "int16",
+  "cogify": true,
+  ...
+}
+```
+
+The processed configuration will have all the logging settings from `global_config.json`, except for `"capture_warnings"`. See [config language rules](config-language.html) for config joins.
+
+## Pipeline chains
+
+Pipeline chains are briefly touched in the config language docs, but only at the syntax level. Here we'll show two common usage patterns.
+
+### End-to-end pipeline chain
+
+In certain use cases we have multiple pipelines that are meant to be run in a certain succession. A great way of organizing that is via order-prefix naming, so `03_export_pipeline.json` is to be run as the third pipeline.
+
+But the user still needs to run them in the correct order and by hand. This we can automate with a simple pipeline chain that links them together:
+```
+[ // end_to_end_run.json
+  {"**download": "${config_path}/01_download.json"},
+  {"**preprocess": "${config_path}/02_preprocess_data.json"},
+  {"**predict": "${config_path}/03_use_model.json"},
+  {"**export": "${config_path}/04_export_maps.json"},
+  {"**ingest": "${config_path}/05_ingest_byoc.json"},
+]
+```
+
+A simple `eogrow end_to_end_run.json` now runs all of these pipelines one after another.
+
+### Rerunning with different parameters
+
+In experimentation we often want to run the same pipeline for multiple parameter values. With a tiny bit of boilerplate this can also be taken care of with config chains.
+
+```
+[ // run_threshold_experiments.json
+  {
+    "variables": {"threshold": 0.1},
+    "**pipeline": "${config_path}/extract_trees.json"
+  },
+  {
+    "variables": {"threshold": 0.2},
+    "**pipeline": "${config_path}/extract_trees.json"
+  },
+  {
+    "variables": {"threshold": 0.3},
+    "**pipeline": "${config_path}/extract_trees.json"
+  },
+  {
+    "variables": {"threshold": 0.4},
+    "**pipeline": "${config_path}/extract_trees.json"
+  }
+]
+```
+
+### Using variables with pipelines
+
+While there is no syntactic sugar for specifying pipeline-chain-wide variables in JSON files, one can do that through CLI. Running `eogrow end_to_end_run.json -v "year:2019"` will set the variable `year` to 2019 for all pipelines in the chain.
+
+## Path modification via variables
+
+In some cases one wants fine grained control over path specifications. The following is a simplified example of how one can provide separate download paths for a large amount of batch pipelines.
+
+```
+{  // global_config.json
+  "storage": {
+    "structure": {
+      "batch_tiffs": "batch-download/tiffs/year-${var:year}-${var:quarter}",
+      ...
+    },
+    ...
+  },
+  ...
+}
+```
+
+```
+{ // batch_download.json
+  "pipeline": "eogrow.pipelines.download_batch.BatchDownloadPipeline",
+  "**global_config": "${config_path}/global_config.json",
+  "output_folder_key": "batch_tiff",
+  "inputs": [
+    {
+      "data_collection": "SENTINEL2_L2A",
+      "time_period": "${var:year}-${var:quarter}"
+    },
+    ...
+  ],
+  ...
+}
+```
+
+We now just need to provide the variables when running the config. This can be done either through the CLI via `eogrow batch_download.json -v "year:2019" -v "quarter:Q1"` or (for increased reproducibility) create configs with the variables specified in advance:
+
+```
+{ // batch_download_2019_Q4.json
+  "**pipeline": "${config_path}/batch_download.json",
+  "variables": {"year": 2019, "quarter": "Q4"}
+}
+```
+
+In such cases, we advise you do not provide any variables in the core pipeline configuration (i.e. "batch_download.json") so that the config parsing fails if not all variables are specified. Otherwise you risk typo-specific problems such as specifying a value for `"yaer"` which won't override the `"year"` variable (and you download data for 2019 instead of 2020).
+
+A similar specific-paths mechanism can also be achieved by modifying the storage manager directly from the final config:
+```
+{ // batch_download_2019_Q4.json
+  "**pipeline": "${config_path}/batch_download.json",
+  "variables": {"year": 2019, "quarter": "Q4"}
+  "storage": {
+    "structure": {
+        "batch_tiffs": "batch-download/tiffs/year-2019-Q4"
+    }
+  }
+}
+```
+While that is sufficient for many cases and more explicit, variables are preffered and might be less error-prone in case of complex folder structures.
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -59,7 +59,7 @@
     "sphinx.ext.githubpages",
     "nbsphinx",
     "sphinx_rtd_theme",
-    "m2r2",
+    "sphinx_mdinclude",
     "sphinxcontrib.autodoc_pydantic",
 ]
 

diff --git a/docs/source/config-language.md b/docs/source/config-language.md
@@ -25,7 +25,7 @@ Additional notes:
 - So far, config language is not completely OS-agnostic and it might not support Windows file paths.
 
 
-### Pipeline joins
+### Pipeline chains
 
 A typical configuration is a dictionary with pipeline parameters. However, it can also be a list of dictionaries. In this case each dictionary must contain parameters of a single pipeline. The order of dictionaries defines the consecutive order in which pipelines will be run. Example:
 
@@ -44,3 +44,5 @@ A typical configuration is a dictionary with pipeline parameters. However, it ca
   ...
 ]
 ```
+
+There is currently no functionality to merge multiple pipeline chains, except by manually concatenating their contents into a single file.