Use new transformer.get_feature_names_out function #248

falcaopetri · 2021-10-17T18:07:36Z

Transformer's get_output_names is getting deprecated in favor of get_feature_names_out. It will be removed by sklearn 1.2 (see sklearn v1.0 changelog, scikit-learn/scikit-learn#18444, and, for example, OneHotEncoder.get_feature_names).

This PR:

Prefers estimator.get_feature_names_out() over estimator.get_features_names()
Configure nox to run tests with both scikit-learn 0.23 and 1.0

The change currently breaks the README#Dynamic Columns example. This happens because there is no StandardScaler.get_features_names in either sklearn 0.23 or 1.0:

Current example's transformed_names_ is ['x_0', 'x_1', 'x_2', 'x_3', 'petal_0', 'petal_1'].
In sklearn 1.0 though, there is a StandardScaler.get_features_names_out, which is used in this PR and therefore produces the output ['x_x0', 'x_x1', 'x_x2', 'x_x3', 'petal_0', 'petal_1'].

- Prefer `estimator.get_feature_names_out()` over `estimator.get_features_names()` - Configure nox to run tests with both scikit-learn 0.23 and 1.0

ragrawal · 2021-10-18T04:27:15Z

FYI: it seems sklearn >= 1.0 requires Python>=3.7.

sklearn_pandas/dataframe_mapper.py

falcaopetri · 2021-10-19T02:28:20Z

One thing I noted is that current implementation is already a little bit inconsistent within sklearn 0.23:

import pandas as pd
import sklearn.preprocessing
from sklearn_pandas import DataFrameMapper

df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
mapper = DataFrameMapper([
    (['col1', 'col2'], sklearn.preprocessing.StandardScaler()),
    (['col1', 'col2'], sklearn.preprocessing.OneHotEncoder()),
], df_out=True)
print(mapper.fit_transform(df).columns)

With sklearn 0.23 or 1.0, the output is:

Index(['col1_col2_0', 'col1_col2_1', 'col1_col2_x0_0', 'col1_col2_x0_1',
       'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
       'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')

Note that StandardScaler cols get called {name}_{i} while OHE gets {name}_{estimator.get_feature_names()}.

Meanwhile sklearn 1.0+this PR outputs:

Index(['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0_0', 'col1_col2_x0_1',
       'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
       'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')

(but all these column names are not very helpful, as discussed in #174)

StochasticBoris · 2022-08-05T07:23:29Z

The latest versions of scikit-learn (1.1+) have improved the coverage of Transformers that implement get_feature_names_out significantly (#21308 - Implement get_feature_names_out for all estimators). Is there any possibility of revisiting this issue? The current naming behaviour of DataFrameMapper is still not working correctly, as demonstrated by @falcaopetri above.

Having correct output names in a pipeline of sequential mappers is cruical and gets out of hand quickly when there are multiple columns in the dataset. The problem is exacerbated with Transformers that operate non-independently on multiple columns (such as PolynomialFeatures, which generates interaction features), since this prohibits the use of gen_features (which otherwise performs column naming better, i.e. without listing all columns for each feature), see example below:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import PolynomialFeatures  

df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
poly = PolynomialFeatures(degree=2, include_bias=False)
mapper = DataFrameMapper([
    (['col1', 'col2'], poly),
], df_out=True)

print(mapper.fit_transform(df).columns)
>>> FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
>>> ['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0^2', 'col1_col2_x0 x1', 'col1_col2_x1^2']

print(poly.get_feature_names_out(['col1', 'col2']))
>>> ['col1' 'col2' 'col1^2' 'col1 col2' 'col2^2']

ragrawal · 2022-08-07T06:00:20Z

Hi, Let me review the PR this week and merge it. I think it will be a major release as we will break some of the existing functionalities.

falcaopetri · 2022-08-07T14:55:27Z

Hi, everyone. Please let me know if there's anything I can do from my end.

hu-minghao · 2022-08-07T14:55:49Z

你好，已收到，谢谢。

* changed the order to use get_features_names_out as the first option . Otherwise fallback on classes_ * Passing input_features

fixing column names

ragrawal · 2022-08-08T03:54:49Z

hi @falcaopetri -- thanks for your contribution. I made few changes to your PR. But I think we need to rethink about whole alias/prefix/suffix . If you are available, we can have a quick chat and discuss how to handle it.

Few updates: It seems there is difference in get_feature_names_out between 1.1.0 and 1.1.2 for sklearn.decomposition.PCA

hu-minghao · 2022-10-11T07:14:39Z

你好，已收到，谢谢。

bryant1410 · 2023-04-26T22:15:06Z

I share here a workaround that works for me:

def _fix_column_names(df: pd.DataFrame, mapper: DataFrameMapper) -> pd.DataFrame:
    for columns, transformer, kwargs in mapper.built_features:
        if (isinstance(transformer, OneHotEncoder)
                or (isinstance(transformer, Pipeline) and any(isinstance(t, OneHotEncoder) for t in transformer))):
            assert isinstance(columns, Iterable) and not isinstance(columns, str)

            new_names = transformer.get_feature_names_out(columns)

            old_name_prefix = kwargs.get("alias", "_".join(str(c) for c in columns))
            old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]

            df = df.rename(columns=dict(zip(old_names, new_names)))
        elif isinstance(transformer, Pipeline) and isinstance(transformer[0], MultiLabelBinarizer):
            # The way sklearn-pandas infers the names is by iterating the transformers and getting the names and trying
            # to get the features names that are available from the last one that has them. Then, it checks if their
            # length matches the output number of features. However, if the binarizer is followed by feature selection,
            # this process fails as the previous condition is not met. So we handle it manually here.
            assert isinstance(columns, str)

            # `MultiLabelBinarizer` doesn't implement `get_feature_names_out`.
            new_names = [f"{columns}_{c}" for c in transformer[0].classes_]

            # We slice as an iterator and not by passing a slice to `__getitem__` because if the transformer is of type
            # `TransformerPipeline` then it fails.
            for t in itertools.islice(transformer, 1, None):
                new_names = t.get_feature_names_out(new_names)

            old_name_prefix = kwargs.get("alias", columns)
            old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]

            df = df.rename(columns=dict(zip(old_names, new_names)))

    return df

hu-minghao · 2023-04-26T22:15:25Z

你好，已收到，谢谢。

Use new transformer.get_feature_names_out function

b894942

- Prefer `estimator.get_feature_names_out()` over `estimator.get_features_names()` - Configure nox to run tests with both scikit-learn 0.23 and 1.0

falcaopetri mentioned this pull request Oct 17, 2021

Column naming: compatibility with OneHotEncoder #241

Open

This was linked to issues Oct 18, 2021

Preserving column names when transformer requires multiple columns as input #174

Open

Column naming: compatibility with OneHotEncoder #241

Open

ragrawal reviewed Oct 18, 2021

View reviewed changes

sklearn_pandas/dataframe_mapper.py Outdated Show resolved Hide resolved

ragrawal added 2 commits August 7, 2022 19:42

Included input_features as one of the parameter

bc5a5f4

* changed the order to use get_features_names_out as the first option . Otherwise fallback on classes_ * Passing input_features

add author name

10eebae

ragrawal changed the title ~~[WIP] Use new transformer.get_feature_names_out function~~ Use new transformer.get_feature_names_out function Aug 8, 2022

ragrawal added 5 commits August 7, 2022 20:25

fixing column names

2a184fb

fixing column names

fixed column output names

1a43a5a

removed poetry

a76b0b6

removed tool versions

c6457e0

fixed lint

71e6c62

ragrawal added 5 commits August 7, 2022 21:01

fixed lint issues

8f0c261

fixed lint issues

5343faa

Merge branch 'master' into rj_fix

1bfe435

reduced number of versions

f38baa3

set minimum version for scikit learn to 1.1.0

65ae376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new transformer.get_feature_names_out function #248

Use new transformer.get_feature_names_out function #248

falcaopetri commented Oct 17, 2021

ragrawal commented Oct 18, 2021

falcaopetri commented Oct 19, 2021

StochasticBoris commented Aug 5, 2022

ragrawal commented Aug 7, 2022

falcaopetri commented Aug 7, 2022

hu-minghao commented Aug 7, 2022 via email

ragrawal commented Aug 8, 2022 •

edited

Loading

hu-minghao commented Oct 11, 2022 via email

bryant1410 commented Apr 26, 2023

hu-minghao commented Apr 26, 2023 via email

Use new transformer.get_feature_names_out function #248

Are you sure you want to change the base?

Use new transformer.get_feature_names_out function #248

Conversation

falcaopetri commented Oct 17, 2021

ragrawal commented Oct 18, 2021

falcaopetri commented Oct 19, 2021

StochasticBoris commented Aug 5, 2022

ragrawal commented Aug 7, 2022

falcaopetri commented Aug 7, 2022

hu-minghao commented Aug 7, 2022 via email

ragrawal commented Aug 8, 2022 • edited Loading

hu-minghao commented Oct 11, 2022 via email

bryant1410 commented Apr 26, 2023

hu-minghao commented Apr 26, 2023 via email

ragrawal commented Aug 8, 2022 •

edited

Loading