FEAT allow metadata to be transformed in a Pipeline #28901

adrinjalali · 2024-04-26T15:41:56Z

Initial proposal: #28440 (comment)
xref: #28440 (comment)

This adds transform_input as a constructor argument to Pipeline, as:

    transform_input : list of str, default=None
        This enables transforming some input arguments to ``fit`` (other than ``X``)
        to be transformed by the steps of the pipeline up to the step which requires
        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
        This can be used to pass a validation set through the pipeline for instance.

        See the example TBD for more details.

        You can only set this if metadata routing is enabled, which you
        can enable using ``sklearn.set_config(enable_metadata_routing=True)``.

It simply allows to transform metadata with fitted estimators up to the step which needs the metadata.

How does this look?

cc @lorentzenchr @ogrisel @amueller @betatim

github-actions · 2024-04-26T15:43:12Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 568f37a. Link to the linter CI: here}

adrinjalali · 2024-05-08T09:31:40Z

So for simple cases where metadata is only used in fit and transform expects no metadata, we're fine. But things get a bit trickier when we start having a transform method of a step accepting the same metadata as the fit of that step.

Specifically, in this test:

@pytest.mark.usefixtures("enable_slep006")
@pytest.mark.parametrize("method", ["fit", "fit_transform"])
def test_transform_input_pipeline(method):
    """Test that with transform_input, data is correctly transformed for each step."""

    def get_transformer(registry, sample_weight, metadata):
        """Get a transformer with requests set."""
        return (
            ConsumingTransformer(registry=registry)
            .set_fit_request(sample_weight=sample_weight, metadata=metadata)
            .set_transform_request(sample_weight=sample_weight, metadata=metadata)
        )

    def get_pipeline():
        """Get a pipeline and corresponding registries.

        The pipeline has 4 steps, with different request values set to test different
        cases. One is aliased.
        """
        registry_1, registry_2, registry_3, registry_4 = (
            _Registry(),
            _Registry(),
            _Registry(),
            _Registry(),
        )
        pipe = make_pipeline(
            get_transformer(registry_1, sample_weight=True, metadata=True),
            get_transformer(registry_2, sample_weight=False, metadata=False),
            get_transformer(registry_3, sample_weight=True, metadata=True),
            get_transformer(registry_4, sample_weight="other_weights", metadata=True),
            transform_input=["sample_weight"],
        )
        return pipe, registry_1, registry_2, registry_3, registry_4

    def check_metadata(registry, methods, **metadata):
        """Check that the right metadata was recorded for the given methods."""
        assert registry
        for estimator in registry:
            for method in methods:
                check_recorded_metadata(
                    estimator,
                    method=method,
                    **metadata,
                )

    X = np.array([[1, 2], [3, 4]])
    y = np.array([0, 1])
    sample_weight = np.array([[1, 2]])
    other_weights = np.array([[30, 40]])
    metadata = np.array([[100, 200]])

    pipe, registry_1, registry_2, registry_3, registry_4 = get_pipeline()
    pipe.fit(
        X,
        y,
        sample_weight=sample_weight,
        other_weights=other_weights,
        metadata=metadata,
    )

    check_metadata(
        registry_1, ["fit", "transform"], sample_weight=sample_weight, metadata=metadata
    )
    check_metadata(registry_2, ["fit", "transform"])
    check_metadata(
        registry_3,
        ["fit", "transform"],
        sample_weight=sample_weight + 2,
        metadata=metadata,
    )
    check_metadata(
        registry_4,
        method.split("_"),  # ["fit", "transform"] if "fit_transform", ["fit"] otherwise
        sample_weight=other_weights + 3,
        metadata=metadata,
    )

Step 3 receives transformed data in its transform method during fit of the pipeline cause all metadata are transformed if they're in transform_input, but a second time when step3.transform is called, the metadata is not transformed (cause I haven't implemented it in pipeline.transform yet).

The question is, what should be the expected behavior?

Do we want transform_input to only transform when calling fit of sub estimators? That's a bit tricky cause all TransformerMixin estimators implement a fit_transform which accepts all metadata together, which means the metadata (if the same name) is either transformed or not transformed. (Wish we didn't have fit_transform in the first place, it's giving us so much headache)

adrinjalali · 2024-05-13T08:28:52Z

Actually, in TransformerMixin we have:

        if _routing_enabled():
            transform_params = self.get_metadata_routing().consumes(
                method="transform", params=fit_params.keys()
            )
            if transform_params:
                warnings.warn(
                    (
                        f"This object ({self.__class__.__name__}) has a `transform`"
                        " method which consumes metadata, but `fit_transform` does not"
                        " forward metadata to `transform`. Please implement a custom"
                        " `fit_transform` method to forward metadata to `transform` as"
                        " well. Alternatively, you can explicitly do"
                        " `set_transform_request`and set all values to `False` to"
                        " disable metadata routed to `transform`, if that's an option."
                    ),
                    UserWarning,
                )

and we never send anything to .transform. So in Pipeline we can also assume things are only transformed for fit, as long as scikit-learn is concerned.

However, for third party transformers where they can have their own fit_transform and route parameters, then things can become tricky, as the example in the previous comment shows.

adrinjalali · 2024-05-14T09:52:40Z

Another question is, do we want to have this syntactic sugar?

pipe = make_pipeline(
    StandardScaler(),
    HistGradientBoostingClassifier(..., early_stopping=True)
).fit(X, y, X_val, y_val)

The above code would:

early_stopping=True would change the default request values so that the user doesn't have to type .set_fit_request(X_val=True, y_val=True)
early_stopping=True sets something in the instance of the estimator which tells pipeline that X_val is of the same nature as X, and therefore should be transformed

It wouldn't change what we have now implemented in Pipeline in this PR, but would make it easier for the user. Not sure if it's too magical for us though.

For that to happen, HGBC need to have:

class HistGradientBoostingClassifier(...):
    ...

    def get_metadata_routing(self):
        routing = super().get_metadata_routing()
        if self.early_stopping:
            routing.fit.add(X_val=True, y_val=True)

    def __sklearn_get_transforming_data__(self):
        return ["X_val"]

And Pipeline would look for info in __sklearn_get_transforming_data__ if it exists.

cc @glemaitre

It goes towards the direction of having more default routing info as @ogrisel really likes. (ref #26179 )

Note that this could come later separately as an enhancement to this PR.

adrinjalali · 2024-05-26T17:43:24Z

There is an issue with testing metadata routing in more complex situations (which has come up in this PR) which requires some fixes (adding parent or caller to how metadata is recorded in testing estimators) which is included in this PR now.

sklearn/tests/metadata_routing_common.py

lorentzenchr

A partial review.
@adrinjalali Just that you see that at least someone cares.

doc/whats_new/v1.6.rst

sklearn/pipeline.py

lorentzenchr · 2024-08-27T16:19:17Z

sklearn/pipeline.py

+        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
+        This can be used to pass a validation set through the pipeline for instance.
+
+        See the example TBD for more details.


I'm very keen to see that example, maybe the HGBT early stopping case?

I went and checked if I could use lighgbm, but there the validation set is passed as a list of tuples. No way to process that in a pipeline.

As for HGBT, it would look like this:

from sklearn.datasets import make_regression from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import GridSearchCV, ShuffleSplit from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0) X[:2,] = X[:2,] + 20 # Validation set chosen before looking at the data. X_val, y_val = X[:50,], y[:50,] X, y = X[50:,], y[50:,] est_gs = GridSearchCV( Pipeline( ( StandardScaler(), HistGradientBoostingRegressor( early_stopping=True, ).set_fit_request(X_val=True, y_val=True), ), # telling pipeline to transform these inputs up to the step which is # requesting them. transform_input=["X_val", "y_val"], ), param_grid={"histgradientboostingregressor__max_depth": list(range(5))}, cv=5, ).fit(X, y, X_val, y_val) # this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with # them.

To confirm my understanding:

est_gs = GridSearchCV( Pipeline( ( StandardScaler(), HistGradientBoostingRegressor( early_stopping=True, ).set_fit_request(X_val=True, y_val=True), ), ), param_grid={"histgradientboostingregressor__max_depth": list(range(5))}, cv=5, ).fit(X, y, X_val, y_val)

This would only transform X, and y, whereas your example would now explicitly mark X_val, y_val (and X and y) to be transformed as well?

Only side remark: Our HistGradientBoostingRegressor does not YET support X_val and y_val in fit.

sklearn/pipeline.py

lorentzenchr · 2024-08-27T16:22:35Z

sklearn/pipeline.py

+        will be transformed.
+
+        `all_params` are the metadata passed by the user. Used to call `transform`
+        on the pipeline itself.


Adding the Parameters section in the docstring might help to better understand this method.

I think this now helps.

sklearn/pipeline.py

sklearn/tests/test_pipeline.py

sklearn/pipeline.py

adrinjalali · 2024-09-02T09:10:54Z

sklearn/pipeline.py

+        will be transformed.
+
+        `all_params` are the metadata passed by the user. Used to call `transform`
+        on the pipeline itself.


I think this now helps.

sklearn/tests/test_pipeline.py

adrinjalali · 2024-09-02T12:06:47Z

sklearn/pipeline.py

+        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
+        This can be used to pass a validation set through the pipeline for instance.
+
+        See the example TBD for more details.


I went and checked if I could use lighgbm, but there the validation set is passed as a list of tuples. No way to process that in a pipeline.

As for HGBT, it would look like this:

from sklearn.datasets import make_regression from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import GridSearchCV, ShuffleSplit from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0) X[:2,] = X[:2,] + 20 # Validation set chosen before looking at the data. X_val, y_val = X[:50,], y[:50,] X, y = X[50:,], y[50:,] est_gs = GridSearchCV( Pipeline( ( StandardScaler(), HistGradientBoostingRegressor( early_stopping=True, ).set_fit_request(X_val=True, y_val=True), ), # telling pipeline to transform these inputs up to the step which is # requesting them. transform_input=["X_val", "y_val"], ), param_grid={"histgradientboostingregressor__max_depth": list(range(5))}, cv=5, ).fit(X, y, X_val, y_val) # this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with # them.

adrinjalali · 2024-09-02T13:35:57Z

I checked lightgbm and xgboost, they both take a list of tuples as validation sets, which makes me think why, and if this proposal is enough!

lorentzenchr · 2024-09-02T18:49:36Z

I checked lightgbm and xgboost, they both take a list of tuples as validation sets, which makes me think why, and if this proposal is enough!

Let's ask them directly: @jameslamb @StrikerRUS @shiyu1994 @trivialfis @hcho3 your opinion would be very appreciated. We are trying to transform metadata on the way to the step of a pipeline where it is needed, e.g. validation data for early stopping in GBTs, see #28901 (comment) (StandardScaler is just for demonstration purposes).

adrinjalali

As for the example, @lorentzenchr I'm not sure what you want me to add, since I we don't have any estimator in scikit-learn which can use this right now. Do you want a fake, mini estimator using X_val in the example? This example makes more sense once we merge the validation set work in HGBT, doesn't it?

sklearn/pipeline.py

sklearn/tests/test_pipeline.py

adrinjalali · 2024-10-18T06:43:53Z

sklearn/tests/test_pipeline.py

+def test_transform_tuple_input():
+    """Test that if metadata is a tuple of arrays, both arrays are transformed."""
+
+    class Estimator(ClassifierMixin, BaseEstimator):
+        def fit(self, X, y, X_val=None, y_val=None):
+            assert isinstance(X_val, tuple)
+            assert isinstance(y_val, tuple)
+            # Here we make sure that each X_val is transformed by the transformer
+            assert_array_equal(X_val[0], np.array([[2, 3]]))
+            assert_array_equal(y_val[0], np.array([0, 1]))
+            assert_array_equal(X_val[1], np.array([[11, 12]]))
+            assert_array_equal(y_val[1], np.array([1, 2]))
+            return self
+
+    class Transformer(TransformerMixin, BaseEstimator):
+        def fit(self, X, y):
+            return self
+
+        def transform(self, X):
+            return X + 1
+
+    X = np.array([[1, 2]])
+    y = np.array([0, 1])
+    X_val0 = np.array([[1, 2]])
+    y_val0 = np.array([0, 1])
+    X_val1 = np.array([[10, 11]])
+    y_val1 = np.array([1, 2])
+    pipe = Pipeline(
+        [
+            ("transformer", Transformer()),
+            ("estimator", Estimator().set_fit_request(X_val=True, y_val=True)),
+        ],
+        transform_input=["X_val"],
+    )
+    pipe.fit(X, y, X_val=(X_val0, X_val1), y_val=(y_val0, y_val1))


the branch is updated @jameslamb

lorentzenchr · 2024-10-18T09:28:58Z

sklearn/pipeline.py

+        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
+        For instance, this can be used to pass a validation set through the pipeline.
+
+        See the example TBD for more details.


Suggested change

See the example TBD for more details.

Let's do that later in a different PR.

lorentzenchr · 2024-10-18T09:29:18Z

sklearn/pipeline.py

+        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.
+        This can be used to pass a validation set through the pipeline for instance.
+
+        See the example TBD for more details.


Suggested change

See the example TBD for more details.

sklearn/tests/test_pipeline.py

trivialfis

Looks good to me, one typo in the comment. Will try to test it with XGBoost.

sklearn/pipeline.py

Co-authored-by: Jiaming Yuan <[email protected]>

glemaitre · 2024-11-07T18:54:22Z

doc/whats_new/upcoming_changes/sklearn.pipeline/28901.major-feature.rst

@@ -0,0 +1,3 @@
+- :class:`pipeline.Pipeline` can now transform metadata up to the step requiring the
+  metadata, which can be set using the `transform_input` parameter.
+  By `Adrin Jalali`_.


Suggested change

By `Adrin Jalali`_.

By `Adrin Jalali`_

…arn into pipeline/transform

glemaitre

LGTM. Only nitpicks and small typos.

sklearn/pipeline.py

Co-authored-by: Guillaume Lemaitre <[email protected]>

glemaitre · 2024-11-15T14:30:28Z

ping @jeremiedbb for backport in the branch

adrinjalali added 4 commits April 15, 2024 11:25

FEAT allow metadata to be transformed in Pipeline

868d0ff

Merge remote-tracking branch 'upstream/main' into pipeline/transform

42dfe81

add tests

94c8bd9

add fit_transform

818da32

github-actions bot added the module:pipeline label Apr 26, 2024

adrinjalali added 5 commits April 29, 2024 13:05

fix pprint test

067946c

Merge remote-tracking branch 'upstream/main' into pipeline/transform

ed5edcd

add changelog

85c10a4

much more extensive tests

ad269ea

Merge remote-tracking branch 'upstream/main' into pipeline/transform

1622203

adrinjalali added 4 commits May 24, 2024 20:25

more fixes

5268514

Merge remote-tracking branch 'upstream/main' into pipeline/transform

1a4a428

WIP tests improvements

052b13d

TST fix pipeline tests

278dc70

adrinjalali mentioned this pull request Jun 6, 2024

FEA SLEP006: Metadata routing for SelfTrainingClassifier #28494

Merged

adam2392 reviewed Jun 6, 2024

View reviewed changes

sklearn/tests/metadata_routing_common.py Outdated Show resolved Hide resolved

adrinjalali mentioned this pull request Jun 10, 2024

TST improve metadata routing tests #29226

Merged

This was referenced Aug 1, 2024

Support early_stopping with custom validation_set #18748

Open

BaggingRegressor with **fit_params with CatBoostRegressor fit(..., eval_set= ()) #29591

Closed

lorentzenchr reviewed Aug 27, 2024

View reviewed changes

adrinjalali added 2 commits September 2, 2024 11:13

Christian's comments

75dbf5d

Merge remote-tracking branch 'upstream/main' into pipeline/transform

ffcbca5

adrinjalali commented Sep 2, 2024

View reviewed changes

adrinjalali added 4 commits October 16, 2024 17:51

rename method

9dbb6b7

address comments

08e7415

Merge remote-tracking branch 'upstream/main' into pipeline/transform

1a7db0f

changelog

486c116

adrinjalali commented Oct 18, 2024

View reviewed changes

lorentzenchr approved these changes Oct 18, 2024

View reviewed changes

adrinjalali added 4 commits October 21, 2024 09:43

remove TBD

06ed90b

Merge remote-tracking branch 'upstream/main' into pipeline/transform

0fc6800

fix tests

179dc88

Merge remote-tracking branch 'upstream/main' into pipeline/transform

d1ec33c

jameslamb reviewed Oct 30, 2024

View reviewed changes

sklearn/tests/test_pipeline.py Outdated Show resolved Hide resolved

adrinjalali added 2 commits October 31, 2024 12:03

remove debug message

18530b9

Merge remote-tracking branch 'upstream/main' into pipeline/transform

c07e043

trivialfis reviewed Nov 3, 2024

View reviewed changes

sklearn/pipeline.py Outdated Show resolved Hide resolved

adrinjalali and others added 2 commits November 4, 2024 09:05

Update sklearn/pipeline.py

f3ecedd

Co-authored-by: Jiaming Yuan <[email protected]>

Merge branch 'main' into pipeline/transform

03f7dda

glemaitre self-requested a review November 7, 2024 17:00

glemaitre reviewed Nov 7, 2024

View reviewed changes

adrinjalali added 3 commits November 7, 2024 20:46

...

df36b33

Merge remote-tracking branch 'upstream/main' into pipeline/transform

7e049a6

Merge branch 'pipeline/transform' of github.com:adrinjalali/scikit-le…

b703976

…arn into pipeline/transform

glemaitre approved these changes Nov 7, 2024

View reviewed changes

adrinjalali and others added 3 commits November 8, 2024 08:50

Apply suggestions from code review

31a847a

Co-authored-by: Guillaume Lemaitre <[email protected]>

lint

496d5f2

Merge remote-tracking branch 'upstream/main' into pipeline/transform

568f37a

glemaitre merged commit 56a4adb into scikit-learn:main Nov 15, 2024
30 checks passed

adrinjalali mentioned this pull request Nov 15, 2024

ENH allow to pass splitter for early stopping validation in HGBT #28440

Open

adrinjalali deleted the pipeline/transform branch November 15, 2024 09:18

adrinjalali mentioned this pull request Nov 21, 2024

Factor out EmptyRequest #30304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT allow metadata to be transformed in a Pipeline #28901

FEAT allow metadata to be transformed in a Pipeline #28901

adrinjalali commented Apr 26, 2024

github-actions bot commented Apr 26, 2024 •

edited

Loading

adrinjalali commented May 8, 2024

adrinjalali commented May 13, 2024

adrinjalali commented May 14, 2024 •

edited

Loading

adrinjalali commented May 26, 2024

lorentzenchr left a comment

lorentzenchr Aug 27, 2024

adrinjalali Sep 2, 2024 •

edited

Loading

adam2392 Oct 18, 2024 •

edited

Loading

adrinjalali Oct 18, 2024

lorentzenchr Oct 18, 2024

lorentzenchr Aug 27, 2024

adrinjalali Sep 2, 2024

adrinjalali Sep 2, 2024

adrinjalali Sep 2, 2024 •

edited

Loading

adrinjalali commented Sep 2, 2024

lorentzenchr commented Sep 2, 2024

adrinjalali left a comment

adrinjalali Oct 18, 2024

lorentzenchr Oct 18, 2024

lorentzenchr Oct 18, 2024

trivialfis left a comment •

edited

Loading

glemaitre Nov 7, 2024

glemaitre left a comment

glemaitre commented Nov 15, 2024

FEAT allow metadata to be transformed in a Pipeline #28901

FEAT allow metadata to be transformed in a Pipeline #28901

Conversation

adrinjalali commented Apr 26, 2024

github-actions bot commented Apr 26, 2024 • edited Loading

✔️ Linting Passed

adrinjalali commented May 8, 2024

adrinjalali commented May 13, 2024

adrinjalali commented May 14, 2024 • edited Loading

adrinjalali commented May 26, 2024

lorentzenchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

adam2392 Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

adrinjalali commented Sep 2, 2024

lorentzenchr commented Sep 2, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Nov 15, 2024

github-actions bot commented Apr 26, 2024 •

edited

Loading

adrinjalali commented May 14, 2024 •

edited

Loading

adrinjalali Sep 2, 2024 •

edited

Loading

adam2392 Oct 18, 2024 •

edited

Loading

adrinjalali Sep 2, 2024 •

edited

Loading

trivialfis left a comment •

edited

Loading