Skip to content

Bug when using sklearn's make_column_selector & default=None #259

Open
@StochasticBoris

Description

@StochasticBoris

Hi,

when calling DataFrameMapper with

  1. default = None
  2. The feature selection specified using sklearn's make_column_selector,

The output of transform() is incorrect, as it passes through ALL columns, not only the ones unaffected by the transformation - as is intended.

First, I have monkeypatched sklearn-pandas to insert some prints into:

    def _unselected_columns(self, X):
        """
        Return list of columns present in X and not selected explicitly in the
        mapper.

        Unselected columns are returned in the order they appear in the
        dataframe to avoid issues with different ordering during default fit
        and transform steps.
        """
        X_columns = list(X.columns)
        
        unselected = [column for column in X_columns if
                column not in self._selected_columns
                and column not in self.drop_cols]
        
        print(f"Selected: {list(self._selected_columns)}")
        print(f"Unselected: {unselected}")

        return unselected

First, if the columns for the feature are passed as a list of strings (by resolving the column_selector directly), everything works as it should:

from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]

categorical_features = make_column_selector(dtype_include=dtype_selection)
categorical_features = categorical_features(X)
print(f"{categorical_features=}, {type(categorical_features)=}")

mapper =  DataFrameMapper(
        [(
            categorical_features,
            DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
        )],
        df_out=True,
        input_df=True,
        default = None,           
 )

 mapper.fit(X, y)
 out = mapper.transform(X)
 print("Output columns: ", out.columns)

>>>Selector: categorical_features=['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest'], type(categorical_features)=<class 'list'>
>>>Selected: ['embarked', 'boat', 'home.dest', 'sex', 'ticket', 'name', 'cabin']
>>>Unselected: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'body']

However, if the column_selector is passed to the features, tuple, the output is incorrect. The selected columns are duplicated in the final output:

from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]

categorical_features = make_column_selector(dtype_include=dtype_selection)
print(f"{categorical_features=}, {type(categorical_features)=}")

mapper =  DataFrameMapper(
        [(
            categorical_features,
            DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
        )],
        df_out=True,
        input_df=True,
        default = None,           
 )

 mapper.fit(X, y)
 out = mapper.transform(X)
 print("Output columns: ", out.columns)
>>>Selector: categorical_features=<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6CB47F0>, type(categorical_features)=<class 'sklearn.compose._column_transformer.make_column_selector'>
>>>Selected: [<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6B0F7C0>]
>>>Unselected: ['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']

This happens because the selected columns are not filtered out properly in the _unselected_columns method, in this part:

unselected = [column for column in X_columns if
                column not in self._selected_columns   # <-- this outputs a function, 'not in' won't work.
                and column not in self.drop_cols]

where the selector function is added as a column during the _selected_columns property in the line selected_columns.add(columns).

To solve this, add handling for sklearn's column_selector either within _unselected_columns or the property _selected_columns.
A temporary solution for others that encountered this: directly use the dtype column_selector on your input dataframe, then only pass columns to the features parameter as lists, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions