Add default training pipeline #184

kvantricht · 2024-10-14T16:32:04Z

This PR adds the training pipelines for the default cropland model trained on global Presto embeddings for the WorldCereal reference database.

jdegerickx · 2024-10-15T08:49:35Z

scripts/spark/train_catboost.py

+            for class_nr in range(len(self.settings["classes"]))
+        ]
+
+        model = CatBoostClassifier(


should these model settings be configurable?

jdegerickx · 2024-10-15T08:56:58Z

src/worldcereal/train/__init__.py

all this filtering of reference datasets and filtering of samples based on rules does not feel generic.
It's also very much tuned to Phase I extractions.
We should probably get rid of this at some point?
Maybe even have a separate repository where we train the global models?

At some point, yes maybe, but this how the model is currently trained so we need to be transparent. I agree that these methods need to evolve in the coming months.

jdegerickx

main comment is whether we really want to commit all this Phase I related cleaning of datasets to this repository?
I guess in the end all default models will be trained based on Phase II RDM samples and extractions.
So perhaps training of global models should (for now) be done in a separate repository, where we import functionality from worldcereal-classification?
Just a suggestion...

kvantricht · 2024-10-15T09:00:35Z

main comment is whether we really want to commit all this Phase I related cleaning of datasets to this repository? I guess in the end all default models will be trained based on Phase II RDM samples and extractions. So perhaps training of global models should (for now) be done in a separate repository, where we import functionality from worldcereal-classification? Just a suggestion...

i don't feel like setting up yet another repository, especially not right now. My thought was to add what there is now for transparency, but I can also accept to not merge it for the time being.

kvantricht and others added 2 commits October 14, 2024 18:31

Default training pipeline added

144b9c6

Updated comment

dc6052f

kvantricht requested a review from jdegerickx October 15, 2024 07:15

jdegerickx reviewed Oct 15, 2024

View reviewed changes

jdegerickx requested changes Oct 15, 2024

View reviewed changes

kvantricht marked this pull request as draft October 15, 2024 12:15

kvantricht added 9 commits October 15, 2024 17:26

Merge branch 'main' into add-default-classificationpipelines

d7d7500

Allow to load model from file

dbf7a54

Add masking to compute presto embeddings

4c7cf84

Dont do weighting based on ref_id

bc27b1e

Add masking to compute embeddings

a17c43d

Revert arg name to avoid issues

f38a9a0

Updated model training

06ed51e

Merge branch 'main' into add-default-classificationpipelines

b63cf4d

Update docstring

5add2a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add default training pipeline #184

Add default training pipeline #184

kvantricht commented Oct 14, 2024 •

edited

Loading

jdegerickx Oct 15, 2024

jdegerickx Oct 15, 2024

kvantricht Oct 15, 2024 •

edited

Loading

jdegerickx left a comment

kvantricht commented Oct 15, 2024

Add default training pipeline #184

Are you sure you want to change the base?

Add default training pipeline #184

Conversation

kvantricht commented Oct 14, 2024 • edited Loading

jdegerickx Oct 15, 2024

Choose a reason for hiding this comment

jdegerickx Oct 15, 2024

Choose a reason for hiding this comment

kvantricht Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

jdegerickx left a comment

Choose a reason for hiding this comment

kvantricht commented Oct 15, 2024

kvantricht commented Oct 14, 2024 •

edited

Loading

kvantricht Oct 15, 2024 •

edited

Loading