Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new transformer.get_feature_names_out function #248

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

falcaopetri
Copy link

Transformer's get_output_names is getting deprecated in favor of get_feature_names_out. It will be removed by sklearn 1.2 (see sklearn v1.0 changelog, scikit-learn/scikit-learn#18444, and, for example, OneHotEncoder.get_feature_names).

This PR:

  • Prefers estimator.get_feature_names_out() over estimator.get_features_names()
  • Configure nox to run tests with both scikit-learn 0.23 and 1.0

The change currently breaks the README#Dynamic Columns example. This happens because there is no StandardScaler.get_features_names in either sklearn 0.23 or 1.0:

  • Current example's transformed_names_ is ['x_0', 'x_1', 'x_2', 'x_3', 'petal_0', 'petal_1'].
  • In sklearn 1.0 though, there is a StandardScaler.get_features_names_out, which is used in this PR and therefore produces the output ['x_x0', 'x_x1', 'x_x2', 'x_x3', 'petal_0', 'petal_1'].

- Prefer `estimator.get_feature_names_out()` over `estimator.get_features_names()`
- Configure nox to run tests with both scikit-learn 0.23 and 1.0
@ragrawal
Copy link
Collaborator

FYI: it seems sklearn >= 1.0 requires Python>=3.7.

@falcaopetri
Copy link
Author

One thing I noted is that current implementation is already a little bit inconsistent within sklearn 0.23:

import pandas as pd
import sklearn.preprocessing
from sklearn_pandas import DataFrameMapper

df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
mapper = DataFrameMapper([
    (['col1', 'col2'], sklearn.preprocessing.StandardScaler()),
    (['col1', 'col2'], sklearn.preprocessing.OneHotEncoder()),
], df_out=True)
print(mapper.fit_transform(df).columns)

With sklearn 0.23 or 1.0, the output is:

Index(['col1_col2_0', 'col1_col2_1', 'col1_col2_x0_0', 'col1_col2_x0_1',
       'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
       'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')

Note that StandardScaler cols get called {name}_{i} while OHE gets {name}_{estimator.get_feature_names()}.

Meanwhile sklearn 1.0+this PR outputs:

Index(['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0_0', 'col1_col2_x0_1',
       'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
       'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')

(but all these column names are not very helpful, as discussed in #174)

@StochasticBoris
Copy link

The latest versions of scikit-learn (1.1+) have improved the coverage of Transformers that implement get_feature_names_out significantly (#21308 - Implement get_feature_names_out for all estimators). Is there any possibility of revisiting this issue? The current naming behaviour of DataFrameMapper is still not working correctly, as demonstrated by @falcaopetri above.

Having correct output names in a pipeline of sequential mappers is cruical and gets out of hand quickly when there are multiple columns in the dataset. The problem is exacerbated with Transformers that operate non-independently on multiple columns (such as PolynomialFeatures, which generates interaction features), since this prohibits the use of gen_features (which otherwise performs column naming better, i.e. without listing all columns for each feature), see example below:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import PolynomialFeatures  

df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
poly = PolynomialFeatures(degree=2, include_bias=False)
mapper = DataFrameMapper([
    (['col1', 'col2'], poly),
], df_out=True)

print(mapper.fit_transform(df).columns)
>>> FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
>>> ['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0^2', 'col1_col2_x0 x1', 'col1_col2_x1^2']

print(poly.get_feature_names_out(['col1', 'col2']))
>>> ['col1' 'col2' 'col1^2' 'col1 col2' 'col2^2']

@ragrawal
Copy link
Collaborator

ragrawal commented Aug 7, 2022

Hi, Let me review the PR this week and merge it. I think it will be a major release as we will break some of the existing functionalities.

@falcaopetri
Copy link
Author

Hi, everyone. Please let me know if there's anything I can do from my end.

@hu-minghao
Copy link

hu-minghao commented Aug 7, 2022 via email

* changed the order to use get_features_names_out as the first option . Otherwise fallback on classes_
* Passing input_features
@ragrawal ragrawal changed the title [WIP] Use new transformer.get_feature_names_out function Use new transformer.get_feature_names_out function Aug 8, 2022
@ragrawal
Copy link
Collaborator

ragrawal commented Aug 8, 2022

hi @falcaopetri -- thanks for your contribution. I made few changes to your PR. But I think we need to rethink about whole alias/prefix/suffix . If you are available, we can have a quick chat and discuss how to handle it.

Few updates: It seems there is difference in get_feature_names_out between 1.1.0 and 1.1.2 for sklearn.decomposition.PCA

@hu-minghao
Copy link

hu-minghao commented Oct 11, 2022 via email

@bryant1410
Copy link

I share here a workaround that works for me:

def _fix_column_names(df: pd.DataFrame, mapper: DataFrameMapper) -> pd.DataFrame:
    for columns, transformer, kwargs in mapper.built_features:
        if (isinstance(transformer, OneHotEncoder)
                or (isinstance(transformer, Pipeline) and any(isinstance(t, OneHotEncoder) for t in transformer))):
            assert isinstance(columns, Iterable) and not isinstance(columns, str)

            new_names = transformer.get_feature_names_out(columns)

            old_name_prefix = kwargs.get("alias", "_".join(str(c) for c in columns))
            old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]

            df = df.rename(columns=dict(zip(old_names, new_names)))
        elif isinstance(transformer, Pipeline) and isinstance(transformer[0], MultiLabelBinarizer):
            # The way sklearn-pandas infers the names is by iterating the transformers and getting the names and trying
            # to get the features names that are available from the last one that has them. Then, it checks if their
            # length matches the output number of features. However, if the binarizer is followed by feature selection,
            # this process fails as the previous condition is not met. So we handle it manually here.
            assert isinstance(columns, str)

            # `MultiLabelBinarizer` doesn't implement `get_feature_names_out`.
            new_names = [f"{columns}_{c}" for c in transformer[0].classes_]

            # We slice as an iterator and not by passing a slice to `__getitem__` because if the transformer is of type
            # `TransformerPipeline` then it fails.
            for t in itertools.islice(transformer, 1, None):
                new_names = t.get_feature_names_out(new_names)

            old_name_prefix = kwargs.get("alias", columns)
            old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]

            df = df.rename(columns=dict(zip(old_names, new_names)))

    return df

@hu-minghao
Copy link

hu-minghao commented Apr 26, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants