Making DataFrameMapper compatible with GridSearchCV #170

devforfu · 2018-09-05T08:14:52Z

In this PR, an attempt to implement the proposal from issue #159 is made. The idea is to write custom get_params and set_params methods that are compatible with scikit-learn grid search objects.

The following snippet shows features supported:

transformer = StandardScaler()
mapper = DataFrameMapper([
    (['colA'], StandardScaler()),
    (['colB'], StandardScaler()),
    ('colC', [StandardScaler(), FunctionTransformer()]
])
pipeline = Pipeline([
    ('mapper', mapper),
    ('classifier', SVC(kernel='linear'))
])

# 1. the pipeline parameters include parameters of the nested transformers
parameters = pipeline.get_params()
assert 'mapper__colA__with_mean' in parameters
assert 'mapper__colA__with_std' in parameters
assert 'mapper__colB__with_mean' in parameters
assert 'mapper__colB__with_std' in parameters

# 2. the parameters of nested transformers can be set from the outside
pipeline.set_params(
    mapper__colA__with_mean=True,
    mapper__colB__with_std=False
)

# 3. getting parameters from list of transformers
assert 'mapper__colC__standardscaler__with_mean' in parameters
assert 'mapper__colC__functiontransformer__func' in parameters

# 4. setting parameters to list of transformers
pipeline.set_params(
    mapper__colC__standardscaler__with_mean=True,
    mapper__colC__functiontransformer__func=np.log1p
)

# 5. grid search with parameters
param_grid = dict(
    mapper__colA__standardscaler__with_mean=[True, False],
    mapper__colB__standardscaler__with_std=[True, False],
    mapper__colC__functiontransformer__func=[np.log1p, np.exp, None]  
)
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)

We still need to add more tests and think about possible edge cases. For example, I am not sure how to handle this case:

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=1))])
pipeline = Pipeline([
    ('mapper', mapper_fs),
    ('classifier', SVC(kernel='linear'))
])
# how to handle transformers with several columns?
pipeline.set_params(...)

Also, I think that the current implementation of set_params could be revised/optimized, and we can handle get_params and set_params for cases when all the transformed columns have only a single transformer using Pipeline class instead of writing a custom code.

Would be glad to know your thoughts and proposals to finalize the PR and make the DataFrameMapper grid-search ready.

prasoon2211 · 2019-03-12T15:34:18Z

Sorry to revive the issue but any chances of this being merged @devforfu? This would be a really nice feature to have.

hu-minghao · 2022-08-27T15:48:16Z

你好，已收到，谢谢。

naveen-marthala · 2022-08-28T06:50:09Z

@devforfu, thanks for the work.

I have copied all the code from the latest commit of your fork and tried it with scikit-learn==1.0.2.

this is how the parameters look when multiple column names are used.

>>> print(pipe_6_dbg5.get_params(deep=True).keys())
dict_keys(['default', 'df_out', 'features', 'input_df', 'sparse', "['AveRooms', 'AveBedrms', 'Population']__degree", "['AveRooms', 'AveBedrms', 'Population']__include_bias", "['AveRooms', 'AveBedrms', 'Population']__interaction_only", "['AveRooms', 'AveBedrms', 'Population']__order", "['AveRooms', 'AveBedrms', 'Population']__copy", "['AveRooms', 'AveBedrms', 'Population']__with_mean", "['AveRooms', 'AveBedrms', 'Population']__with_std", "['AveOccup', 'HouseAge']__copy", "['AveOccup', 'HouseAge']__norm"])

reproducible examples:

## getting data
from sklearn.datasets import fetch_california_housing
cal_house = fetch_california_housing(as_frame=True)
cal_house = pd.merge(left=cal_house['data'], right=cal_house['target'], left_index=True, right_index=True)

## making pipeline
from sklearn import pipeline, preprocessing

## `DataFrameMapper` code from https://github.com/devforfu/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py 
pipe_6_dbg5 = DataFrameMapper(features=[
                            (['AveRooms', 'AveBedrms', 'Population'],
                                preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
                            (['AveRooms', 'AveBedrms', 'Population'], preprocessing.StandardScaler()),
                            (['AveOccup', 'HouseAge'], preprocessing.Normalizer()),
                        ], default=None, df_out=True, input_df=True)

pipe_6_dbg5.fit(X=cal_house.drop(columns='MedHouseVal', axis=1), y=cal_house.loc[:,'MedHouseVal'])
print(pipe_6_dbg5.get_params(deep=True).keys())

params were similar(with square-brackets and single quotes in param names) when I inherited DataFrameMapper from latest version of sklearn-pandas and over-wrote your code of get_params and set_params

It would be a lot more useful, intuitive and practical if sklearn_pandas.DataFrameMapper also took a name like sklearn.compose.ColumnTransformer.

also, your code uses the column names to name the parameters. setting the params gets unintuitive when same column is used multiple times in different transformer steps. in my example code above, if both the StandardScaler and Normalizer used same set of column names, setting parameter like copy just doesn't make sense.

devforfu added 4 commits August 14, 2018 19:49

Get/set parameters for a single transformer

b31e89e

Parameters getters and setters

45992e1

Merge with upstream

ae88b71

Testing pipeline with GridSearchCV

89e6d11

devforfu added the enhancement label Sep 5, 2018

Merge remote-tracking branch 'upstream/master'

55920da

abhi8893 mentioned this pull request May 5, 2020

DataFrameMapper doesn't include hyperparameters of the nested transformers abhi8893/Titanic-Survival#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making DataFrameMapper compatible with GridSearchCV #170

Making DataFrameMapper compatible with GridSearchCV #170

devforfu commented Sep 5, 2018

prasoon2211 commented Mar 12, 2019

hu-minghao commented Aug 27, 2022 via email

naveen-marthala commented Aug 28, 2022 •

edited

Loading

Making DataFrameMapper compatible with GridSearchCV #170

Are you sure you want to change the base?

Making DataFrameMapper compatible with GridSearchCV #170

Conversation

devforfu commented Sep 5, 2018

prasoon2211 commented Mar 12, 2019

hu-minghao commented Aug 27, 2022 via email

naveen-marthala commented Aug 28, 2022 • edited Loading

naveen-marthala commented Aug 28, 2022 •

edited

Loading