-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use new transformer.get_feature_names_out function #248
base: master
Are you sure you want to change the base?
Use new transformer.get_feature_names_out function #248
Conversation
- Prefer `estimator.get_feature_names_out()` over `estimator.get_features_names()` - Configure nox to run tests with both scikit-learn 0.23 and 1.0
FYI: it seems sklearn >= 1.0 requires Python>=3.7. |
One thing I noted is that current implementation is already a little bit inconsistent within sklearn 0.23: import pandas as pd
import sklearn.preprocessing
from sklearn_pandas import DataFrameMapper
df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
mapper = DataFrameMapper([
(['col1', 'col2'], sklearn.preprocessing.StandardScaler()),
(['col1', 'col2'], sklearn.preprocessing.OneHotEncoder()),
], df_out=True)
print(mapper.fit_transform(df).columns) With sklearn 0.23 or 1.0, the output is:
Note that Meanwhile sklearn 1.0+this PR outputs:
(but all these column names are not very helpful, as discussed in #174) |
The latest versions of scikit-learn (1.1+) have improved the coverage of Transformers that implement Having correct output names in a pipeline of sequential mappers is cruical and gets out of hand quickly when there are multiple columns in the dataset. The problem is exacerbated with from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
poly = PolynomialFeatures(degree=2, include_bias=False)
mapper = DataFrameMapper([
(['col1', 'col2'], poly),
], df_out=True)
print(mapper.fit_transform(df).columns)
>>> FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
>>> ['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0^2', 'col1_col2_x0 x1', 'col1_col2_x1^2']
print(poly.get_feature_names_out(['col1', 'col2']))
>>> ['col1' 'col2' 'col1^2' 'col1 col2' 'col2^2'] |
Hi, Let me review the PR this week and merge it. I think it will be a major release as we will break some of the existing functionalities. |
Hi, everyone. Please let me know if there's anything I can do from my end. |
你好,已收到,谢谢。
|
* changed the order to use get_features_names_out as the first option . Otherwise fallback on classes_ * Passing input_features
hi @falcaopetri -- thanks for your contribution. I made few changes to your PR. But I think we need to rethink about whole alias/prefix/suffix . If you are available, we can have a quick chat and discuss how to handle it. Few updates: It seems there is difference in get_feature_names_out between 1.1.0 and 1.1.2 for |
你好,已收到,谢谢。
|
I share here a workaround that works for me: def _fix_column_names(df: pd.DataFrame, mapper: DataFrameMapper) -> pd.DataFrame:
for columns, transformer, kwargs in mapper.built_features:
if (isinstance(transformer, OneHotEncoder)
or (isinstance(transformer, Pipeline) and any(isinstance(t, OneHotEncoder) for t in transformer))):
assert isinstance(columns, Iterable) and not isinstance(columns, str)
new_names = transformer.get_feature_names_out(columns)
old_name_prefix = kwargs.get("alias", "_".join(str(c) for c in columns))
old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]
df = df.rename(columns=dict(zip(old_names, new_names)))
elif isinstance(transformer, Pipeline) and isinstance(transformer[0], MultiLabelBinarizer):
# The way sklearn-pandas infers the names is by iterating the transformers and getting the names and trying
# to get the features names that are available from the last one that has them. Then, it checks if their
# length matches the output number of features. However, if the binarizer is followed by feature selection,
# this process fails as the previous condition is not met. So we handle it manually here.
assert isinstance(columns, str)
# `MultiLabelBinarizer` doesn't implement `get_feature_names_out`.
new_names = [f"{columns}_{c}" for c in transformer[0].classes_]
# We slice as an iterator and not by passing a slice to `__getitem__` because if the transformer is of type
# `TransformerPipeline` then it fails.
for t in itertools.islice(transformer, 1, None):
new_names = t.get_feature_names_out(new_names)
old_name_prefix = kwargs.get("alias", columns)
old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]
df = df.rename(columns=dict(zip(old_names, new_names)))
return df |
你好,已收到,谢谢。
|
Transformer's
get_output_names
is getting deprecated in favor ofget_feature_names_out
. It will be removed by sklearn 1.2 (see sklearn v1.0 changelog, scikit-learn/scikit-learn#18444, and, for example, OneHotEncoder.get_feature_names).This PR:
estimator.get_feature_names_out()
overestimator.get_features_names()
The change currently breaks the README#Dynamic Columns example. This happens because there is no
StandardScaler.get_features_names
in either sklearn 0.23 or 1.0:transformed_names_
is['x_0', 'x_1', 'x_2', 'x_3', 'petal_0', 'petal_1']
.StandardScaler.get_features_names_out
, which is used in this PR and therefore produces the output['x_x0', 'x_x1', 'x_x2', 'x_x3', 'petal_0', 'petal_1']
.