Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when using sklearn's make_column_selector & default=None #259

Open
StochasticBoris opened this issue Oct 13, 2022 · 0 comments
Open

Comments

@StochasticBoris
Copy link

Hi,

when calling DataFrameMapper with

  1. default = None
  2. The feature selection specified using sklearn's make_column_selector,

The output of transform() is incorrect, as it passes through ALL columns, not only the ones unaffected by the transformation - as is intended.

First, I have monkeypatched sklearn-pandas to insert some prints into:

    def _unselected_columns(self, X):
        """
        Return list of columns present in X and not selected explicitly in the
        mapper.

        Unselected columns are returned in the order they appear in the
        dataframe to avoid issues with different ordering during default fit
        and transform steps.
        """
        X_columns = list(X.columns)
        
        unselected = [column for column in X_columns if
                column not in self._selected_columns
                and column not in self.drop_cols]
        
        print(f"Selected: {list(self._selected_columns)}")
        print(f"Unselected: {unselected}")

        return unselected

First, if the columns for the feature are passed as a list of strings (by resolving the column_selector directly), everything works as it should:

from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]

categorical_features = make_column_selector(dtype_include=dtype_selection)
categorical_features = categorical_features(X)
print(f"{categorical_features=}, {type(categorical_features)=}")

mapper =  DataFrameMapper(
        [(
            categorical_features,
            DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
        )],
        df_out=True,
        input_df=True,
        default = None,           
 )

 mapper.fit(X, y)
 out = mapper.transform(X)
 print("Output columns: ", out.columns)

>>>Selector: categorical_features=['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest'], type(categorical_features)=<class 'list'>
>>>Selected: ['embarked', 'boat', 'home.dest', 'sex', 'ticket', 'name', 'cabin']
>>>Unselected: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'body']

However, if the column_selector is passed to the features, tuple, the output is incorrect. The selected columns are duplicated in the final output:

from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]

categorical_features = make_column_selector(dtype_include=dtype_selection)
print(f"{categorical_features=}, {type(categorical_features)=}")

mapper =  DataFrameMapper(
        [(
            categorical_features,
            DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
        )],
        df_out=True,
        input_df=True,
        default = None,           
 )

 mapper.fit(X, y)
 out = mapper.transform(X)
 print("Output columns: ", out.columns)
>>>Selector: categorical_features=<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6CB47F0>, type(categorical_features)=<class 'sklearn.compose._column_transformer.make_column_selector'>
>>>Selected: [<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6B0F7C0>]
>>>Unselected: ['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']

This happens because the selected columns are not filtered out properly in the _unselected_columns method, in this part:

unselected = [column for column in X_columns if
                column not in self._selected_columns   # <-- this outputs a function, 'not in' won't work.
                and column not in self.drop_cols]

where the selector function is added as a column during the _selected_columns property in the line selected_columns.add(columns).

To solve this, add handling for sklearn's column_selector either within _unselected_columns or the property _selected_columns.
A temporary solution for others that encountered this: directly use the dtype column_selector on your input dataframe, then only pass columns to the features parameter as lists, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant