Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

Commit

Permalink
more examples and details
Browse files Browse the repository at this point in the history
  • Loading branch information
fmigneault committed Apr 2, 2024
1 parent 1a50057 commit 1faf4d9
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 15 deletions.
46 changes: 36 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
- **Title:** Machine Learning Model Extension
- **Identifier:** [https://schemas.stacspec.org/2.0.0.alpha.0/extensions/ml-model/json-schema/schema.json](https://schemas.stacspec.org/2.0.0.alpha.0/extensions/ml-model/json-schema/schema.json)
- **Field Name Prefix:** mlm
- **Scope:** Item, Collection
- **Scope:** Collection, Item, Asset, Links
- **Extension Maturity Classification:** Proposal
- **Owner:**
- [@fmigneault](https://github.com/fmigneault)
Expand All @@ -19,7 +19,8 @@ trained on overhead imagery and enable running model inference.
The main objectives of the extension are:

1) to enable building model collections that can be searched alongside associated STAC datasets
2) record all necessary bands, parameters, modeling artifact locations, and high-level processing steps to deploy an inference service.
2) record all necessary bands, parameters, modeling artifact locations, and high-level processing steps to deploy
an inference service.

Specifically, this extension records the following information to make ML models searchable and reusable:
1. Sensor band specifications
Expand All @@ -31,7 +32,8 @@ Specifically, this extension records the following information to make ML models
The MLM specification is biased towards providing metadata fields for supervised machine learning models.
However, fields that relate to supervised ML are optional and users can use the fields they need for different tasks.

See [Best Practices](./best-practices.md) for guidance on what other STAC extensions you should use in conjunction with this extension.
See [Best Practices](./best-practices.md) for guidance on what other STAC extensions you should use in conjunction
with this extension.
The Machine Learning Model Extension purposely omits and delegates some definitions to other STAC extensions to favor
reusability and avoid metadata duplication whenever possible. A properly defined MLM STAC Item/Collection should almost
never have the Machine Learning Model Extension exclusively in `stac_extensions`.
Expand All @@ -53,14 +55,22 @@ extension to synthesize common use cases into a single reference for Machine Lea

## Item Properties and Collection Fields

The fields in the table below can be used in these parts of STAC documents:

- [ ] Catalogs
- [x] Collections
- [x] Item Properties (incl. Summaries in Collections)
- [x] Assets (for both Collections and Items, incl. Item Asset Definitions in Collections, except `mlm:name`)
- [ ] Links

| Field Name | Type | Description |
|-----------------------------|--------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| mlm:name | string | **REQUIRED** A unique name for the model. This can include, but must be distinct, from simply naming the model architecture. If there is a publication or other published work related to the model, use the official name of the model. |
| mlm:architecture | [Model Architecture](#model-architecture) string | **REQUIRED** A generic and well established architecture name of the model. |
| mlm:tasks | [[Task Enum](#task-enum)] | **REQUIRED** Specifies the Machine Learning tasks for which the model can be used for. If multi-tasks outputs are provided by distinct model heads, specify all available tasks under the main properties and specify respective tasks in each [Model Output Object](#model-output-object). |
| mlm:framework | string | **REQUIRED** Framework used to train the model (ex: PyTorch, TensorFlow). |
| mlm:framework_version | string | The `framework` library version. Some models require a specific version of the machine learning `framework` to run. |
| mlm:memory_size | integer | **REQUIRED** The in-memory size of the model on the accelerator during inference (bytes). |
| mlm:memory_size | integer | The in-memory size of the model on the accelerator during inference (bytes). |
| mlm:accelerator | [Accelerator Enum](#accelerator-enum) \| null | The intended computational hardware that runs inference. If undefined or set to `null` explicitly, the model does not require any specific accelerator. |
| mlm:accelerator_constrained | boolean | Indicates if the intended `accelerator` is the only `accelerator` that can run inference. If undefined, it should be assumed `false`. |
| mlm:accelerator_summary | string | A high level description of the `accelerator`, such as its specific generation, or other relevant inference details. |
Expand All @@ -71,9 +81,22 @@ extension to synthesize common use cases into a single reference for Machine Lea
| mlm:input | [[Model Input Object](#model-input-object)] | **REQUIRED** Describes the transformation between the EO data and the model input. |
| mlm:output | [[Model Output Object](#model-output-object)] | **REQUIRED** Describes each model output and how to interpret it. |

In addition, fields from the following extensions must be imported in the item:
- [Scientific Extension Specification][stac-ext-sci] to describe relevant publications.
- [Version Extension Specification][stac-ext-ver] to define version tags.
To decide whether above fields should be applied under Item `properties` or under respective Assets, the context of
each field must be considered. For example, the `mlm:name` should always be provided in the Item `properties`, since
it relates to the model as a whole. In contrast, some models could support multiple `mlm:accelerator`, which could be
handled by distinct source code represented by different Assets. In such case, `mlm:accelerator` definitions should be
nested under their relevant Asset. If a field is defined both at the Item and Asset level, the value at the Asset level
would be considered for that specific Asset, and the value at the Item level would be used for other Assets that did
not override it for their respective reference. For some of the fields, further details are provided in following
sections to provide more precisions regarding some potentially ambiguous use cases.

In addition, fields from the multiple relevant extensions should be defined as applicable. See
[Best Practices - Recommended Extensions to Compose with the ML Model Extension](best-practices.md#recommended-extensions-to-compose-with-the-ml-model-extension)
for more details.

For the [Extent Object](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#extent-object)
in STAC Collections and the corresponding spatial and temporal fields in Items, please refer to section
[Best Practices - Using STAC Common Metadata Fields for the ML Model Extension](best-practices.md#using-stac-common-metadata-fields-for-the-ml-model-extension).

[stac-ext-sci]: https://github.com/radiantearth/stac-spec/tree/v1.0.0-beta.2/extensions/scientific/README.md
[stac-ext-ver]: https://github.com/radiantearth/stac-spec/tree/v1.0.0-beta.2/extensions/version/README.md
Expand Down Expand Up @@ -411,9 +434,12 @@ The following types should be used as applicable `rel` types in the
[Link Object](https://github.com/radiantearth/stac-spec/tree/master/item-spec/item-spec.md#link-object)
of STAC Items describing Band Assets that result from the inference of a model described by the MLM extension.

| Type | Description |
|--------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| derived_from | This link points to a STAC Collection or Item using MLM, using the corresponding [`mlm:name`](#item-properties-and-collection-fields) value. |
| Type | Description |
|--------------|----------------------------------------------------------|
| derived_from | This link points to a STAC Collection or Item using MLM. |

It is recommended that the link using `derived_from` referring to another STAC definition using the MLM extension
specifies the [`mlm:name`](#item-properties-and-collection-fields) value to make the derived reference more explicit.

Note that a derived product from model inference described by STAC should also consider using
additional indications that it came of a model, such as described by
Expand Down
33 changes: 28 additions & 5 deletions best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,35 @@ models or creating tools to work with STAC.

## Using STAC Common Metadata Fields for the ML Model Extension

It is recommended to use the `start_datetime` and `end_datetime`, `geometry`, and `bbox` to represent the
recommended context of data the model was trained with and for which the model should have appropriate domain
knowledge for inference. For example, we can consider a model which is trained on imagery from all over the world
It is recommended to use the `start_datetime` and `end_datetime`, `geometry`, and `bbox` in a STAC Item,
and the corresponding
[Extent Object](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#extent-object)
in a Collection, to represent the *recommended context* of the data the model was trained with and for which the model
should have appropriate domain knowledge for inference.

For example, if a model was trained using the [EuroSAT][EuroSAT-github] dataset, and represented using MLM, it would
be reasonable to describe it with a time range of 2015-2018 and an area corresponding to the European Urban Atlas, as
described by the [EuroSAT paper][EuroSAT-paper]. However, it could also be considered adequate to define a wider extent,
considering that it would not be unexpected to have reasonably similar classes and domain distribution in following
years and in other locations. Provided that the exact extent applicable for a model is difficult to define reliably,
it is left to the good judgement of users to provide adequate values. Note that users employing the model can also
choose to apply it for contexts outside the *recommended* extent for the same reason.

[EuroSAT-github]: https://github.com/phelber/EuroSAT
[EuroSAT-paper]: https://www.researchgate.net/publication/319463676

As another example, let us consider a model which is trained on imagery from all over the world
and is robust enough to be applied to any time period. In this case, the common metadata to use with the model
would include the bbox of "the world" `[-90, -180, 90, 180]` and the `start_datetime` and `end_datetime` range could
be generic values like `["1900-01-01", null]`.
could include the bbox of "the world" `[-90, -180, 90, 180]` and the `start_datetime` and `end_datetime` range could
be generic values like `["1900-01-01", null]`. However, it is to be noted that generic and very broad spatiotemporal
extents like these rarely reflect the reality regarding the capabilities and precision of the model to predict reliable
results. If a more restrained area and time of interest can be identified, such as the ranges for which the training
dataset applies, or a test split dataset that validates the applicability of the model on other domains, those should
be provided instead.

If specific datasets with training/validation/test splits are known to support the claims of the suggested extent for
the model, it is recommended that they are included as reference to the STAC Item/Collection using MLM. For more
information regarding these references, see the [ML-AOI and Label Extensions](#ml-aoi-and-label-extensions) details.

## Recommended Extensions to Compose with the ML Model Extension

Expand Down

0 comments on commit 1faf4d9

Please sign in to comment.