Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

Commit

Permalink
more best practices (relates to stac-extensions/classification#48 and s…
Browse files Browse the repository at this point in the history
  • Loading branch information
fmigneault-crim committed Mar 30, 2024
1 parent 4d765c2 commit 4db3b94
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 14 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ extension to synthesize common use cases into a single reference for Machine Lea
| mlm:total_parameters | integer | Total number of model parameters, including trainable and non-trainable parameters. |
| mlm:pretrained_source | string | The source of the pretraining. Can refer to popular pretraining datasets by name (i.e. Imagenet) or less known datasets by URL and description. |
| mlm:summary | string | Text summary of the model and it's purpose. |
| batch_size_suggestion | number | A suggested batch size for the accelerator and summarized hardware. |
| mlm:batch_size_suggestion | number | A suggested batch size for the accelerator and summarized hardware. |

In addition, fields from the following extensions must be imported in the item:
- [Scientific Extension Specification][stac-ext-sci] to describe relevant publications.
Expand Down Expand Up @@ -233,6 +233,10 @@ Note that the URI including the specific commit hash, release number or target b
other means of referring to checkout procedures, although this specification does not prohibit the use of additional
properties to better describe the Asset.

Since the source code of a model provides useful example on how to use it, it is also recommended to define relevant
references to documentation using the `example` extension.
See the [Best Practices - Example Extension](best-practices.md#example-extension) section for more details.

Recommended asset `roles` include `code` and `metadata`,
since the source code asset might also refer to more detailed metadata than this specification captures.

Expand Down
49 changes: 36 additions & 13 deletions best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ models or creating tools to work with STAC.
- [Classification Extension](#classification-extension)
- [Scientific Extension](#scientific-extension)
- [File Extension](#file-extension)
- [Example Extension](#example-extension)
- [Version Extension](#version-extension)

## Using STAC Common Metadata Fields for the ML Model Extension
Expand Down Expand Up @@ -131,6 +132,9 @@ MLM definition to indicate which class values can be contained in the resulting

For more details, see the [Model Output Object](README.md#model-output-object) definition.

> [!NOTE]
> Update according to https://github.com/stac-extensions/classification/issues/48
### Scientific Extension

Provided that most models derive from previous scientific work, it is strongly recommended to employ the
Expand All @@ -142,19 +146,6 @@ lead to its creation.
This extension can also be used for the purpose of publishing new models, by providing to users the necessary details
regarding how they should cite its use (i.e.: `sci:citation` field and `cite-as` relation type).

### Version Extension

In the even that a model is retrained with gradually added annotations or improved training strategies leading to
better performances, the existing model and newer models represented by STAC Items with MLM should also make use of
the [Version Extension](https://github.com/stac-extensions/version). Using the fields and link relation types defined
by this extension, the retraining cycle of the model can better be described, with a full history of the newer versions
developed.

Additionally, the `version:experimental` field should be considered for models being trained and still under evaluation
before widespread deployment. This can be particularly useful for annotating models experiments during cross-validation
training process to find the "best model". This field could also be used to indicate if a model is provided for
educational purposes only.

### File Extension

In order to provide a reliable and reproducible machine learning pipeline, external references to data required by the
Expand Down Expand Up @@ -187,3 +178,35 @@ that the model is properly instantiated from the expected weights, or that suffi
}
}
```

### Example Extension

In order to help users understand how to apply and run the described machine learning model,
the [Example Extension](https://github.com/stac-extensions/example-links#fields) can be used to provide code examples
demonstrating how it can be applied.

For example, a [Model Card on Hugging Face](https://huggingface.co/docs/hub/en/model-cards)
is often provided (see [Hugging Face Model examples](https://huggingface.co/models)) to describe the model, which
can embed sample code and references to more details about the model. This kind of reference should be added under
the `links` of the STAC Item using MLM.

Typically, a STAC Item using the MLM extension to describe the training or
inference strategies to apply a model should define the [Source Code Asset](README.md#source-code-asset).
This code is in itself ideal to guide users how to run it, and should therefore be replicated as an `example` link
reference to offer more code samples to execute the model.

> [!NOTE]
> Update according to https://github.com/stac-extensions/example-links/issues/4
### Version Extension

In the even that a model is retrained with gradually added annotations or improved training strategies leading to
better performances, the existing model and newer models represented by STAC Items with MLM should also make use of
the [Version Extension](https://github.com/stac-extensions/version). Using the fields and link relation types defined
by this extension, the retraining cycle of the model can better be described, with a full history of the newer versions
developed.

Additionally, the `version:experimental` field should be considered for models being trained and still under evaluation
before widespread deployment. This can be particularly useful for annotating models experiments during cross-validation
training process to find the "best model". This field could also be used to indicate if a model is provided for
educational purposes only.

0 comments on commit 4db3b94

Please sign in to comment.