Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making DataFrameMapper compatible with GridSearchCV #170

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

devforfu
Copy link
Collaborator

@devforfu devforfu commented Sep 5, 2018

In this PR, an attempt to implement the proposal from issue #159 is made. The idea is to write custom get_params and set_params methods that are compatible with scikit-learn grid search objects.

The following snippet shows features supported:

transformer = StandardScaler()
mapper = DataFrameMapper([
    (['colA'], StandardScaler()),
    (['colB'], StandardScaler()),
    ('colC', [StandardScaler(), FunctionTransformer()]
])
pipeline = Pipeline([
    ('mapper', mapper),
    ('classifier', SVC(kernel='linear'))
])

# 1. the pipeline parameters include parameters of the nested transformers
parameters = pipeline.get_params()
assert 'mapper__colA__with_mean' in parameters
assert 'mapper__colA__with_std' in parameters
assert 'mapper__colB__with_mean' in parameters
assert 'mapper__colB__with_std' in parameters

# 2. the parameters of nested transformers can be set from the outside
pipeline.set_params(
    mapper__colA__with_mean=True,
    mapper__colB__with_std=False
)

# 3. getting parameters from list of transformers
assert 'mapper__colC__standardscaler__with_mean' in parameters
assert 'mapper__colC__functiontransformer__func' in parameters

# 4. setting parameters to list of transformers
pipeline.set_params(
    mapper__colC__standardscaler__with_mean=True,
    mapper__colC__functiontransformer__func=np.log1p
)

# 5. grid search with parameters
param_grid = dict(
    mapper__colA__standardscaler__with_mean=[True, False],
    mapper__colB__standardscaler__with_std=[True, False],
    mapper__colC__functiontransformer__func=[np.log1p, np.exp, None]  
)
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)

We still need to add more tests and think about possible edge cases. For example, I am not sure how to handle this case:

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=1))])
pipeline = Pipeline([
    ('mapper', mapper_fs),
    ('classifier', SVC(kernel='linear'))
])
# how to handle transformers with several columns?
pipeline.set_params(...)

Also, I think that the current implementation of set_params could be revised/optimized, and we can handle get_params and set_params for cases when all the transformed columns have only a single transformer using Pipeline class instead of writing a custom code.

Would be glad to know your thoughts and proposals to finalize the PR and make the DataFrameMapper grid-search ready.

@prasoon2211
Copy link

Sorry to revive the issue but any chances of this being merged @devforfu? This would be a really nice feature to have.

@hu-minghao
Copy link

hu-minghao commented Aug 27, 2022 via email

@naveen-marthala
Copy link

naveen-marthala commented Aug 28, 2022

@devforfu, thanks for the work.

I have copied all the code from the latest commit of your fork and tried it with scikit-learn==1.0.2.

this is how the parameters look when multiple column names are used.

>>> print(pipe_6_dbg5.get_params(deep=True).keys())
dict_keys(['default', 'df_out', 'features', 'input_df', 'sparse', "['AveRooms', 'AveBedrms', 'Population']__degree", "['AveRooms', 'AveBedrms', 'Population']__include_bias", "['AveRooms', 'AveBedrms', 'Population']__interaction_only", "['AveRooms', 'AveBedrms', 'Population']__order", "['AveRooms', 'AveBedrms', 'Population']__copy", "['AveRooms', 'AveBedrms', 'Population']__with_mean", "['AveRooms', 'AveBedrms', 'Population']__with_std", "['AveOccup', 'HouseAge']__copy", "['AveOccup', 'HouseAge']__norm"])

reproducible examples:

## getting data
from sklearn.datasets import fetch_california_housing
cal_house = fetch_california_housing(as_frame=True)
cal_house = pd.merge(left=cal_house['data'], right=cal_house['target'], left_index=True, right_index=True)

## making pipeline
from sklearn import pipeline, preprocessing

## `DataFrameMapper` code from https://github.com/devforfu/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py 
pipe_6_dbg5 = DataFrameMapper(features=[
                            (['AveRooms', 'AveBedrms', 'Population'],
                                preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
                            (['AveRooms', 'AveBedrms', 'Population'], preprocessing.StandardScaler()),
                            (['AveOccup', 'HouseAge'], preprocessing.Normalizer()),
                        ], default=None, df_out=True, input_df=True)

pipe_6_dbg5.fit(X=cal_house.drop(columns='MedHouseVal', axis=1), y=cal_house.loc[:,'MedHouseVal'])
print(pipe_6_dbg5.get_params(deep=True).keys())

params were similar(with square-brackets and single quotes in param names) when I inherited DataFrameMapper from latest version of sklearn-pandas and over-wrote your code of get_params and set_params

It would be a lot more useful, intuitive and practical if sklearn_pandas.DataFrameMapper also took a name like sklearn.compose.ColumnTransformer.

also, your code uses the column names to name the parameters. setting the params gets unintuitive when same column is used multiple times in different transformer steps. in my example code above, if both the StandardScaler and Normalizer used same set of column names, setting parameter like copy just doesn't make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants